adding source.csv file name to first column while merging multiple csv - pandas

I have thousands of CSv files, which i have to merge before analysis. I want to add columns having source file name with each row having a sub-number (example SRR1313_1, SRR1313_2, SRR1313_3 and so on.
One limitaion of merging huge data is RAM, which results in error. here is the cmds i tried, but cant add sub-numbering after file name
import pandas as pd
import glob
import os
from pathlib import Path
path = '/home/shoeb/Desktop/vergDB/DATA'
folder = glob.glob(os.path.join("*.csv"))
out_file = "final_big_file.tsv"
first_file=True
for file in folder:
df = pd.read_csv(file, skiprows=1)
df = df.dropna()
df.insert(loc=0, column ="sequence_id", value =file [0:10])
if first_file:
df.to_csv(out_file, sep= "\t", index=False)
first_file=False
else:
df.to_csv(out_file, sep= "\t", index=False, header= False, mode='a')

Related

assigning csv file to a variable name

I have a .csv file, i uses pandas to read the .csv file.
import pandas as pd
from pandas import read_csv
data=read_csv('input.csv')
print(data)
0 1 2 3 4 5
0 -3.288733e-08 2.905263e-08 2.297046e-08 2.052534e-08 3.767194e-08 4.822049e-08
1 2.345769e-07 9.462636e-08 4.331173e-08 3.137627e-08 4.680112e-08 6.067109e-08
2 -1.386798e-07 1.637338e-08 4.077676e-08 3.339685e-08 5.020153e-08 5.871679e-08
3 -4.234607e-08 3.555008e-08 2.563824e-08 2.320405e-08 4.008257e-08 3.901410e-08
4 3.899913e-08 5.368551e-08 3.713510e-08 2.367323e-08 3.172775e-08 4.799337e-08
My aim is to assign the file to a column name so that i can access the data in later time. For example by doing something like
new_data= df['filename']
filename
0 -3.288733e-08,2.905263e-08,2.297046e-08,2.052534e-08,3.767194e-08,4.822049e-08
1 2.345769e-07,9.462636e-08,4.331173e-08,3.137627e-08,4.680112e-08, 6.067109e-08
2 -1.386798e-07,1.637338e-08,4.077676e-08,3.339685e-08,5.020153e-08,5.871679e-08
3 -4.234607e-08,3.555008e-08,2.563824e-08,2.320405e-08,4.008257e-08,3.901410e-08
4 3.899913e-08,5.368551e-08,3.713510e-08,2.367323e-08,3.172775e-08,4.799337e-08
I don't really like it (and I still don't completely get the point), but you could just read in your data as 1 column (by using a 'wrong' seperator) and renaming the column.
import pandas as pd
filename = 'input.csv'
df = pd.read_csv(filename, sep=';')
df.columns = [filename]
If you then wish, you could add other files by doing the same thing (with a different name for df at first) and then concatenate that with df.
A more usefull approach IMHO would be to add the dataframe to a dictionary (or a list would be possible).
import pandas as pd
filename = 'input.csv'
df = pd.read_csv(filename)
data_dict = {filename: df}
# ... Add multiple files to data_dict by repeating steps above in a loop
You can then access your data later on by calling data_dict[filename] or data_dict['input.csv']

How can I sort and replace the rank of each item instead of it's value before sorting in a merged final csv?

I have 30 different csv files and each row begin with date and some similar features measured for each of 30 items daily. The value of each feature is not important, but the rank they gain after sorting in each day is important. How can I have one merged csv from 30 separate csv with the rank of each feature?
If your files are all the same format, you can do a loop and consolidate in a single data frame. From there you can manipulate as needed:
import pandas as pd
import glob
path = r'C:\path_to_files\files' # use your path
all_files = glob.glob(path + "/*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
Other examples of the same thing: Import multiple csv files into pandas and concatenate into one DataFrame.

Selecting multiple excel files in different folders and comparing them with pandas

I am trying to compare multiple excel files with pandas and glob, but i am getting the folowing output from my code:
Empty DataFrame
Columns: []
Index: []
This is my code:
import glob
import pandas as pd
all_data = pd.DataFrame()
path = 'S:\data\**\C_76_00_a?.xlsx'
for f in glob.glob(path):
df = pd.read_excel(f, sheet_name=None, ignore_index=True, skiprows=10, usecols=4)
cdf = pd.concat(df.values(), axis=1)
all_data = all_data.append(cdf,ignore_index=True)
print(all_data)
I am using jupyter notebook and the folder structure is \2021\12\2021-30-12\ and the file name containg "C_76_00_a" +"something unknown +".xlsx"
Below is a screenshot of the type of document i am trying to compare

How to add/edit text in pandas.io.parsers.TextFileReader

I have a large file in CSV. Since it is a large file(almost 7 GB) , it cannot be converted into a pandas dataframe.
import pandas as pd
df1 = pd.read_csv('tblViewPromotionDataVolume_202004070600.csv', sep='\t', iterator=True, chunksize=1000)
for chunk in df1:
print (chunk)
df1 is of type pandas.io.parsers.TextFileReader
Now i want to edit/add/insert some text(a new row) into this file , and convert it back to a pandas dataframe. Please let me know of possible solutions. Thanks in advance.
Here is DataFrame called chunk, so for processing use it, last for write to file use DataFrame.to_csv with mode='a' for append mode:
import pandas as pd
import os
infile = 'tblViewPromotionDataVolume_202004070600.csv'
outfile = 'out.csv'
df1 = pd.read_csv(infile, sep='\t', iterator=True, chunksize=1000)
for chunk in df1:
print (chunk)
#processing with chunk
# https://stackoverflow.com/a/30991707/2901002
# if file does not exist write header with first chunk
if not os.path.isfile(outfile):
chunk.to_csv(, sep='\t')
else: # else it exists so append without writing the header
chunk.to_csv('out.csv', sep='\t', mode='a', header=False)

Concatenate multiple file rowwise in a single dataframe with the same header name

I have 400 csv files and all the files contains single column which has 4667 rows . Every row has name and corresponding value for example "A=54,B=56 and so on till 4667 rows. My problem statement is :
1. fetch the variable name and put it different columns
2. fetch the corresponding variable value and put it in the next rows above the columns.
3. Now, Do this step for all the 400 files and append all the corresponding values in the above rows which makes 400 rows.
I have done for the single file and how to do with the multiple files . I don't Know
import glob
from collections import OrderedDict
path =r'Github/dataset/Raw_Dataset/'
filenames = glob.glob(path + "/*.csv")
dict_of_df = OrderedDict((f, pd.read_csv((f),header=None,names=['Devices'])) for f in filenames)
eda=pd.concat(dict_of_df)
I
Do you mean a concatenation?
you could use pandas like this:
import pandas as pd
df1 = pd.DataFrame(columns=["A","B","C","D"],data=[["1","2","3","4"]])
df2 = pd.DataFrame(columns=["A","B","C","D"],data=[["5","6","7","8"]])
df3 = pd.DataFrame(columns=["A","B","C","D"],data=[["9","10","11","12"]])
df_concat = pd.concat([df1, df2, df3])
print(df_concat)
In combination with a loop through the files in the folder ending with ".csv" it would be:
import os
import pandas as pd
path = r"C:\YOUR\DICTIONARY"
df_concat = pd.DataFrame()
for filename in os.listdir(path):
if filename.endswith(".csv"):
print(filename)
df_temp = pd.read_csv(path + "\\" + filename)
df_concat = pd.concat([df_concat, df_temp])
continue
else:
continue