Use pandas df.concat to replace .append with custom index - pandas

I'm currently trying to replace .append in my code since it won't be supported in the future and I have some trouble with the custom index I'm using
I read the names of every .shp files in a directory and extract some date from it
To make the link with an excel file I have, I use the name I extract from the title of the file
df = pd.DataFrame(columns = ['date','fichier'])
for i in glob.glob("*.shp"):
nom_parcelle = i.split("_")[2]
if not nom_parcelle in df.index:
# print(df.last_valid_index())
date_recolte = i.split("_")[-1]
new_row = pd.Series(data={'date':date_recolte.split(".")[0], 'fichier':i}, name = nom_parcelle)
df = df.append(new_row, ignore_index=False)
This works exactly as I want it to be
Sadly, I can't find a way to replace it with .concat
I looked for ways to keep the index whith concat but didn't find anything that worked as I intended
Did I miss anything?

Try the approach below with pandas.concat based on your code :
import glob
import pandas as pd
​
df = pd.DataFrame(columns = ['date','fichier'])
dico_dfs={}
​
for i in glob.glob("*.shp"):
nom_parcelle = i.split("_")[2]
if not nom_parcelle in df.index:
# print(df.last_valid_index())
date_recolte = i.split("_")[-1]
new_row = pd.Series(data={'date':date_recolte.split(".")[0], 'fichier':i}, name = nom_parcelle)
dico_dfs[i]= new_row.to_frame()
df= pd.concat(dico_dfs, ignore_index=False, axis=1).T.droplevel(0)
# Output :
print(df)
date fichier
nom1 20220101 a_xx_nom1_20220101.shp
nom2 20220102 b_yy_nom2_20220102.shp
nom3 20220103 c_zz_nom3_20220103.shp

Related

Copy/assign a Pandas dataframe based on their name in a for loop

I am relatively new with python - and I am struggling to do the following:
I have a set of different data frames, with sequential naming (df_i), which I want to access in a for loop based on their name (with an string), how can I do that? e.g.
df_1 = pd.read_csv('...')
df_2 = pd.read_csv('...')
df_3 = pd.read_csv('...')
....
n_df = 3
for i in range(len(n_df)):
df_namestr= 'df_' + str(i+1)
# ---------------------
df_temp = df_namestr
# ---------------------
# Operate with df_temp. For i+1= 1, df_temp should be df_1
Kind regards,
DF
You can try something like that:
for n in range(1, n_df+1):
df_namestr = f"df_{n}"
df_tmp = locals().get(df_namestr)
if not isinstance(df_tmp, pd.DataFrame):
continue
print(df_namestr)
print(df_tmp)
Refer to the documentation of locals() to know more.
Would it be better to approach the accessing of multiple dataframes by reading them into a list?
You could put all the csv files required in a subfolder and read them all in. Then they are in a list and you can access each one as an item in that list.
Example:
import pandas as pd
import glob
path = r'/Users/myUsername/Documents/subFolder'
csv_files = glob.glob(path + "/*.csv")
dfs = []
for filename in csv_files:
df = pd.read_csv(filename)
dfs.append(df)
print(len(dfs))
print(dfs[1].head())

Pandas saving in text format

I am trying to save the output, which is a number ,to a text format in pandas after working on the dataset.
import pandas as pd
df = pd.read_csv("sales.csv")
def HighestSales():
df.drop(['index', "month"], axis =1, inplace = True)
df2 = df.groupby("year").sum()
df2 = df2.sort_values(by = 'sales', ascending = True).reset_index()
df3 = df2.loc[11, 'year']
df4 = pd.Series(df3)
df5 = df4.iloc[0]
#*the output here is 1964 , which alone needs to be saved in the text file*.
df5.to_csv("modified.txt")
HighestSales()
But I get 'numpy.int64' object has no attribute 'to_csv'- this error . Is there a way to save just one single value in the text file?
you can do:
# open a file named modified.txt
with open('modified.txt', 'w') as f:
# df5 is just an integer of 196
# and write 1964 plus a line break
f.write(df5 + '\n')
You cannot save a single value to csv by using "pd.to_csv". In your case you should convert it into DataFrame again and then saving it. If you want to see only the number in .txt file, you need to add some parameters:
result = pd.DataFrame(df5)
result.to_csv('modified.txt', index=False, header=False)

Why my code is giving me data in 1 column it should give me in two different column

i need to know what is happening in my code? it should give data in separate columns it is giving me same data in a oath columns.
i tried to change the value of row variable but it didn't found the reason
import requests
import csv
from bs4 import BeautifulSoup
import pandas as pd
import time
arrayofRequest= []
prices=[]
location=[]
columns=['Price', 'Location']
df = pd.DataFrame(columns=columns)
for i in range(0,50):
arrayofRequest.append("https://www.zameen.com/Homes/Karachi-2-"+str(i+1)+".html?gclid=Cj0KCQjw3JXtBRC8ARIsAEBHg4mj4jX1zZUt3WzGScjH6nfwzrEqkuILarcmg372imSneelSXPj0fGIaArNeEALw_wcB")
request = requests.get(arrayofRequest[i])
soupobj= BeautifulSoup(request.content,"lxml")
# print(soupobj.prettify())
links =soupobj.find_all('span',{'class':'f343d9ce'})
addresses =soupobj.find_all('div',{'class':'_162e6469'})
price = ""
for i in range(0,len(links)):
price = str(links[i]).split(">")
price = price[len(price)-2].split("<")[0]
prices.append(price)
address = str(addresses[i]).split(">")
address = address[len(address)-2].split("<")[0]
location.append(address)
row=location[i]+","+prices[i]
df = df.append(pd.Series(row, index=columns), ignore_index=False)
# filewriter = csv.writer(csvfile, delimiter=',',filewriter.writerow(['Price', 'Location']),filewriter.writerow([prices[0],location[0]])
df.to_csv('DATA.csv', index=False)
because of this:
pd.Series(row, index=columns)
try smthg like
pd.DataFrame([[locations[i], prices[i]]], index=columns))
However this could be done only once outside of your for loop
pd.DataFrame(list(zip(locations, prices)), index=columns))

Pandas - Trying to save a set of files by reading it using Pandas but only the latest file gets saved

I am trying to read a set of txt files into Pandas as below. I see I am able to read them to a Dataframe however when I try to save the Dataframe it only saves the last file it read. However when I perform print(df) it prints all the records.
Given below is the code I am using:
files = '/users/user/files'
list = []
for file in files:
df = pd.read_csv(file)
list.append(df)
print(df)
df.to_csv('file_saved_path')
Could anyone advice why is the last file only being saved to the csv file and now the entire list.
Expected output:
output1
output2
output3
Current output:
output1,output2,output3
Try this:
path = '/users/user/files'
for id in range(len(os.listdir(path))):
file = os.listdir(path)[id]
data = pd.read_csv(path+'/'+file, sep='\t')
if id == 0:
df1 = data
else:
data = pd.concat([df1, data], ignore_index=True)
data.to_csv('file_saved_path')
First change variable name list, because code word in python (builtin), then for final DataFrame use concat:
files = '/users/user/files'
L = []
for file in files:
df = pd.read_csv(file)
L.append(df)
bigdf = pd.concat(L, ignore_index=True)
bigdf.to_csv('file_saved_path')

Use pandas to read the csv file with several uncertain factors

I have asked the related question of string in: Find the number of \n before a given word in a long string. But this method cannot solve the complicate case I happened to. Thus I want to find out a solution of Pandas here.
I have a csv file (I just represent as a string):
csvfile = 'Idnum\tId\nkey:maturity\n2\nmaturity\tpara1\tpara2\n1Y\t0\t0\n2Y\t0\t0'
I want to use the pandas:
value = pandas.read_csv(csvfile, sep = '\t', skiprows = 3).set_index('maturity')
to obtain the table like:
and set the first columan maturity as index.
But there are several uncertain factors in the csvfile:
1..set_index('maturity'), the key maturity
of index is included in the row key: maturity. Then I should find the row key: xxxx and obtain the string xxxx
2.skiprows = 3: the number of skipped rows before the title:
is uncertain. The csvfile can be something like:
'Idnum\tId\nkey:maturity\n2\n\n\n\n\n\nmaturity\tpara1\tpara2\n1Y\t0\t0\n2Y\t0\t0'
I should find the row number of title (namely the row beginning with xxxx found in the rowkey: xxxx).
3.sep = '\t': the csvfile may use space as separator like:
csvfile = 'Idnum Id\nkey: maturity\n2\nmaturity para1 para2\n1Y 0 0\n2Y 0 0'
So is there any general code of pandas to deal with the csvfile with above uncertain factors?
Actually the string:
csvfile = 'Idnum\tId\nkey:maturity\n2\nmaturity\tpara1\tpara2\n1Y\t0\t0\n2Y\t0\t0'
is from a StringIO: data
data.getvalue() = 'Idnum\tId\nkey:maturity\n2\nmaturity\tpara1\tpara2\n1Y\t0\t0\n2Y\t0\t0'
I am not familiar with this structure and even I want to obtain a table of original data without any edition by using:
value = pandas.read_csv(data, sep = '\t')
There will be a error.
You can read the file line by line, collecting the necessary information and then pass the remainder to pd.read_csv with the appropriate arguments:
from io import StringIO
import re
import pandas as pd
with open('data.csv') as fh:
key = next(filter(lambda x: x.startswith('key:'), fh)).lstrip('key:').strip()
header = re.split('[ \t]+', next(filter(lambda x: x.startswith(key), fh)).strip())
df = pd.read_csv(StringIO(fh.read()), header=None, names=header, index_col=0, sep=r'\s+')
Example for data via StringIO:
fh = StringIO('Idnum\tId\nkey:maturity\n2\nmaturity\tpara1\tpara2\n1Y\t0\t0\n2Y\t0\t0')
key = next(filter(lambda x: x.startswith('key:'), fh)).lstrip('key:').strip()
header = re.split('[ \t]+', next(filter(lambda x: x.startswith(key), fh)).strip())
df = pd.read_csv(fh, header=None, names=header, index_col=0, sep=r'\s+')
If you do not mind reading the csv file twice you can try doing something like:
from io import StringIO
csvfile = 'Idnum\tId\nkey:maturity\n2\nmaturity\tpara1\tpara2\n1Y\t0\t0\n2Y\t0\t0'
data = pd.read_csv(StringIO(csvfile), sep='\t', error_bad_lines=False, header=None)
skiprows = len(data)
pd.read_csv(StringIO(csvfile), sep='\t', skiprows=skiprows)
same for you other example:
csvfile = 'Idnum\tId\nkey:maturity\n2\n\n\n\n\n\nmaturity\tpara1\tpara2\n1Y\t0\t0\n2Y\t0\t0'
data = pd.read_csv(StringIO(csvfile), sep='\t', error_bad_lines=False, header=None)
skiprows = len(data)
pd.read_csv(StringIO(csvfile), sep='\t', skiprows=skiprows)
This assumes that you know the sep of the file
Also if you want to find the key:
csvfile = 'Idnum\tId\nkey:maturity\n2\n\n\n\n\n\nmaturity\tpara1\tpara2\n1Y\t0\t0\n2Y\t0\t0'
data = pd.read_csv(StringIO(csvfile), sep='\t', error_bad_lines=False, header=None)
key = [x.replace('key:','') for x in data[0] if x.find('key')>-1]
skiprows = len(data)
pd.read_csv(StringIO(csvfile), sep='\t', skiprows=skiprows).set_index(key)