Concatenate multiple file rowwise in a single dataframe with the same header name - dataframe

I have 400 csv files and all the files contains single column which has 4667 rows . Every row has name and corresponding value for example "A=54,B=56 and so on till 4667 rows. My problem statement is :
1. fetch the variable name and put it different columns
2. fetch the corresponding variable value and put it in the next rows above the columns.
3. Now, Do this step for all the 400 files and append all the corresponding values in the above rows which makes 400 rows.
I have done for the single file and how to do with the multiple files . I don't Know
import glob
from collections import OrderedDict
path =r'Github/dataset/Raw_Dataset/'
filenames = glob.glob(path + "/*.csv")
dict_of_df = OrderedDict((f, pd.read_csv((f),header=None,names=['Devices'])) for f in filenames)
eda=pd.concat(dict_of_df)
I

Do you mean a concatenation?
you could use pandas like this:
import pandas as pd
df1 = pd.DataFrame(columns=["A","B","C","D"],data=[["1","2","3","4"]])
df2 = pd.DataFrame(columns=["A","B","C","D"],data=[["5","6","7","8"]])
df3 = pd.DataFrame(columns=["A","B","C","D"],data=[["9","10","11","12"]])
df_concat = pd.concat([df1, df2, df3])
print(df_concat)
In combination with a loop through the files in the folder ending with ".csv" it would be:
import os
import pandas as pd
path = r"C:\YOUR\DICTIONARY"
df_concat = pd.DataFrame()
for filename in os.listdir(path):
if filename.endswith(".csv"):
print(filename)
df_temp = pd.read_csv(path + "\\" + filename)
df_concat = pd.concat([df_concat, df_temp])
continue
else:
continue

Related

assigning csv file to a variable name

I have a .csv file, i uses pandas to read the .csv file.
import pandas as pd
from pandas import read_csv
data=read_csv('input.csv')
print(data)
0 1 2 3 4 5
0 -3.288733e-08 2.905263e-08 2.297046e-08 2.052534e-08 3.767194e-08 4.822049e-08
1 2.345769e-07 9.462636e-08 4.331173e-08 3.137627e-08 4.680112e-08 6.067109e-08
2 -1.386798e-07 1.637338e-08 4.077676e-08 3.339685e-08 5.020153e-08 5.871679e-08
3 -4.234607e-08 3.555008e-08 2.563824e-08 2.320405e-08 4.008257e-08 3.901410e-08
4 3.899913e-08 5.368551e-08 3.713510e-08 2.367323e-08 3.172775e-08 4.799337e-08
My aim is to assign the file to a column name so that i can access the data in later time. For example by doing something like
new_data= df['filename']
filename
0 -3.288733e-08,2.905263e-08,2.297046e-08,2.052534e-08,3.767194e-08,4.822049e-08
1 2.345769e-07,9.462636e-08,4.331173e-08,3.137627e-08,4.680112e-08, 6.067109e-08
2 -1.386798e-07,1.637338e-08,4.077676e-08,3.339685e-08,5.020153e-08,5.871679e-08
3 -4.234607e-08,3.555008e-08,2.563824e-08,2.320405e-08,4.008257e-08,3.901410e-08
4 3.899913e-08,5.368551e-08,3.713510e-08,2.367323e-08,3.172775e-08,4.799337e-08
I don't really like it (and I still don't completely get the point), but you could just read in your data as 1 column (by using a 'wrong' seperator) and renaming the column.
import pandas as pd
filename = 'input.csv'
df = pd.read_csv(filename, sep=';')
df.columns = [filename]
If you then wish, you could add other files by doing the same thing (with a different name for df at first) and then concatenate that with df.
A more usefull approach IMHO would be to add the dataframe to a dictionary (or a list would be possible).
import pandas as pd
filename = 'input.csv'
df = pd.read_csv(filename)
data_dict = {filename: df}
# ... Add multiple files to data_dict by repeating steps above in a loop
You can then access your data later on by calling data_dict[filename] or data_dict['input.csv']

adding source.csv file name to first column while merging multiple csv

I have thousands of CSv files, which i have to merge before analysis. I want to add columns having source file name with each row having a sub-number (example SRR1313_1, SRR1313_2, SRR1313_3 and so on.
One limitaion of merging huge data is RAM, which results in error. here is the cmds i tried, but cant add sub-numbering after file name
import pandas as pd
import glob
import os
from pathlib import Path
path = '/home/shoeb/Desktop/vergDB/DATA'
folder = glob.glob(os.path.join("*.csv"))
out_file = "final_big_file.tsv"
first_file=True
for file in folder:
df = pd.read_csv(file, skiprows=1)
df = df.dropna()
df.insert(loc=0, column ="sequence_id", value =file [0:10])
if first_file:
df.to_csv(out_file, sep= "\t", index=False)
first_file=False
else:
df.to_csv(out_file, sep= "\t", index=False, header= False, mode='a')

How can I sort and replace the rank of each item instead of it's value before sorting in a merged final csv?

I have 30 different csv files and each row begin with date and some similar features measured for each of 30 items daily. The value of each feature is not important, but the rank they gain after sorting in each day is important. How can I have one merged csv from 30 separate csv with the rank of each feature?
If your files are all the same format, you can do a loop and consolidate in a single data frame. From there you can manipulate as needed:
import pandas as pd
import glob
path = r'C:\path_to_files\files' # use your path
all_files = glob.glob(path + "/*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
Other examples of the same thing: Import multiple csv files into pandas and concatenate into one DataFrame.

Add column with filename wildcard

I have files that have the pattern
XXXX____________030621_120933_D.csv
YYYY____________030621_120933_E.csv
ZZZZ____________030621_120933_F.csv
I am using glob.glob and for loop to parse each file to pandas to create Data frame of which i will merge at the end. I want to add a column which will add the XXXX,YYYY, and ZZZZ to each data frame accordingly
I can create the column called ID with df['ID'] and want to pick the value from the filenames. is the easiest way to grab that from the filename when reading the CSV and processing via pd
If the file names are as what you have presented, then use this code:
dir_path = #path to your directory
file_paths = glob.glob(dir_path + '*.csv')
result = pd.DataFrame()
for file_ in file_paths :
df = pd.read_csv(file_)
df['ID'] = file_[<index of the ID>]
result = result.append(df, ignore_index=True)
Finding the right index might take a bit of time, but that should do it.

how to put first value in one column and remaining into other column?

ROCO2_CLEF_00001.jpg,C3277934,C0002978
ROCO2_CLEF_00002.jpg,C3265939,C0002942,C2357569
I want to make a pandas data frame from csv file.
I want to put first row entry(filename) into a column and give the column/header name "filenames", and remaining entries into another column name "class". How to do so?
in case your file hasn't a fixed number of commas per row, you could do the following:
import pandas as pd
csv_path = 'test_csv.csv'
raw_data = open(csv_path).readlines()
# clean rows
raw_data = [x.strip().replace("'", "") for x in raw_data]
print(raw_data)
# make split between data
raw_data = [ [x.split(",")[0], ','.join(x.split(",")[1:])] for x in raw_data]
print(raw_data)
# build the pandas Dataframe
column_names = ["filenames", "class"]
temp_df = pd.DataFrame(data=raw_data, columns=column_names)
print(temp_df)
filenames class
0 ROCO2_CLEF_00001.jpg C3277934,C0002978
1 ROCO2_CLEF_00002.jpg C3265939,C0002942,C2357569