Creating new dataframe by search result in df - pandas

I am reading a txt file for search variable.
I am using this variable to find it in a dataframe.
for lines in lines_list:
sn = lines
if sn in df[df['SERIAL'].str.contains(sn)]:
condition = df[df['SERIAL'].str.contains(sn)]
df_new = pd.DataFrame(condition)
df_new.to_csv('try.csv',mode='a', sep=',', index=False)
When I check the try.csv file, it has much more lines the txt file has.
The df has a lots of lines, more than the txt file.
I want save the whole line from search result into a dataframe or file
I tried to append the search result to a new dataframe or csv.

first create line list
f = open("text.txt", "r")
l = list(map(lambda x: x.strip(), f.readlines()))
write this apply func has comparing values and filtering
def apply_func(x):
if str(x) in l:
return x
return np.nan
and get output
df["Serial"] = df["Serial"].apply(apply_func)
df.dropna(inplace=True)
df.to_csv("new_df.csv", mode="a", index=False)
or try filter method
f = open("text.txt", "r")
l = list(map(lambda x: x.strip(), f.readlines()))
df = df.set_index("Serial").filter(items=l, axis=0).reset_index()
df.to_csv("new_df.csv", mode="a", index=False)

Related

Better way to concatenate panda matrices

I need to concatenate multiple matrices (containing numbers and strings) in a loop, so far I wrote this solution but I don't like to use a dummy variable (h) and I'm sure the code could be improved.
Here it is:
h = 0
for name in list_of_matrices:
h +=1
Matrix = pd.read_csv(name)
if h == 1:
Matrix_final = Matrix
continue
Matrix_final = pd.concat([Matrix_final,Matrix])
For some reason if I use the following code I end up having 2 matrices one after the other and not a joint one (so this code is not fitting):
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
li.append(df)

Pandas - Append dataframe to new excel sheet for multiple files

I want my new columns that are the output in a new sheet named "Analyzed Data" for multiple files. Each file has a different amount of columns with varying names.
import os
import pandas as pd
path = r'C:\Users\Me\1Test'
filelist = []
for root, dirs, files in os.walk(path):
for f in files:
if not f.endswith('.txt'):
continue
filelist.append(os.path.join(root, f))
for f in filelist:
df = pd.read_table(f)
col = df.iloc[ : , : -3]
df['Average'] = col.mean(axis = 1)
out = (df.join(df.drop(df.columns[[-3,-1]], axis=1)
.sub(df[df.columns[-3]], axis=0)
.add_suffix(' - Background')))
out.to_excel(f.replace('txt', 'xlsx'), 'Sheet1')

Saving multiple dataframe in sheets and books based on the condition

I am able to save multiple dataframe in multiple excel sheets.
writer = pd.ExcelWriter('Cloud.xlsx', engine='xlsxwriter')
frames = {'Image': df1, 'Objects': df2, 'Text': df3 , 'Labels': df4}
for sheet, frame in frames.items():
frame.to_excel(writer, sheet_name = sheet)
writer.save()
Now I want to create multiple files based on the dataframes column. For example, I want to create 4 excel files:
df1
Category URL Obj
A example.com Chair
A example2.com table
B example3.com glass
B example4.com tv
So my all datframes have 7 categories and I want to create 7 files based the category.
I think you need:
frames = {'Image': df1, 'Objects': df2, 'Text': df3 , 'Labels': df4}
for sheet, frame in frames.items():
for cat, g in frame.groupby('Category'):
# if file does not exist
if not os.path.isfile(f'{cat}.xlsx'):
writer = pd.ExcelWriter(f'{cat}.xlsx')
else: # else it exists
writer = pd.ExcelWriter(f'{cat}.xlsx', mode='a', engine='openpyxl')
g.to_excel(writer, sheet_name = sheet)
writer.save()

Dataframe index rows all 0's

I'm iterating through PDF's to obtain the text entered in the form fields. When I send the rows to a csv file it only exports the last row. When I print results from the Dataframe, all the row indexes are 0's. I have tried various solutions from stackoverflow, but I can't get anything to work, what should be 0, 1, 2, 3...etc. are coming in as 0, 0, 0, 0...etc.
Here is what I get when printing results, only the last row exports to csv file:
0
0 1938282828
0
0 1938282828
0
0 22222222
infile = glob.glob('./*.pdf')
for i in infile:
if i.endswith('.pdf'):
pdreader = PdfFileReader(open(i,'rb'))
diction = pdreader.getFormTextFields()
myfieldvalue2 = str(diction['ID'])
df = pd.DataFrame([myfieldvalue2])
print(df)`
Thank you for any help!
You are replacing the same dataframe each time:
infile = glob.glob('./*.pdf')
for i in infile:
if i.endswith('.pdf'):
pdreader = PdfFileReader(open(i,'rb'))
diction = pdreader.getFormTextFields()
myfieldvalue2 = str(diction['ID'])
df = pd.DataFrame([myfieldvalue2]) # this creates new df each time
print(df)
Correct Code:
infile = glob.glob('./*.pdf')
df = pd.DataFrame()
for i in infile:
if i.endswith('.pdf'):
pdreader = PdfFileReader(open(i,'rb'))
diction = pdreader.getFormTextFields()
myfieldvalue2 = str(diction['ID'])
df = df.append([myfieldvalue2])
print(df)

vectorization of loop in pandas

I've been trying to vectorize the following with no such luck:
Consider two data frames. One is a list of dates:
cols = ['col1', 'col2']
index = pd.date_range('1/1/15','8/31/18')
df = pd.DataFrame(columns = cols )
What i'm doing currently is looping thru df and getting the counts of all rows that are less than or equal to the date in question with my main (large) dataframe df_main
for x in range(len(index)):
temp_arr = []
active = len(df_main[(df_main.n_date <= index[x])]
temp_arr = [index[x],active]
df= df.append(pd.Series(temp_arr,index=cols) ,ignore_index=True)
Is there a way to vectorize the above?
What about something like the following
#initializing
mycols = ['col1', 'col2']
myindex = pd.date_range('1/1/15','8/31/18')
mydf = pd.DataFrame(columns = mycols )
#create df_main (that has each of myindex's dates minus 10 days)
df_main = pd.DataFrame(data=myindex-pd.Timedelta(days=10), columns=['n_date'])
#wrap a dataframe around a list comprehension
mydf = pd.DataFrame([[x, len(df_main[df_main['n_date'] <= x])] for x in myindex])