Loop through and concat spatial files with Geopandas - dataframe

I've been trying to batch process a series of .geojson files for demographic census tract data into into one output GeoDataFrame. All the inputs have the same file shape, shared index/label and geometry columns. For example:
Table 1:
Geo
Businesses
Geometry
Census Tract 1
5
Polygon(x,y)
Census Tract 2
6
Polygon(x,y)
Table 2:
Geo
Employees
Geometry
Census Tract 1
100
Polygon(x,y)
Census Tract 2
2
Polygon(x,y)
Table 3:
Geo
Loans
Geometry
Census Tract 1
94
Polygon(x,y)
Census Tract 2
3
Polygon(x,y)
...and so on.
I expect to get an output like this:
Geo
Businesses
Employees
Loans
Geometry
Census Tract 1
5
100
94
Polygon(x,y)
Census Tract 2
6
2
3
Polygon(x,y)
I've had scucess doing this for .csv and .xlxs files, but I have not been able to successfully with geopandas. So far my workflow has been the below:
First, I set point the .geojson files I want in my directory:
# point to files in data_path directory so we can loop them all
files = os.listdir(inputs_path)
files_mo_geojson = [f for f in files if f.endswith('MO.geojson')
#view
print(files_mo_geojson)
['SBALoans_CensusTract_MO.geojson',
'Businesses_5_9_Employees_CensusTract_MO.geojson',
'Businesses_Under50_Employees_CensusTract_MO.geojson',
'Businesses_1_4_Employees_CensusTract_MO.geojson']
then i make an empty geodataframe
gdf_mo = geopandas.GeoDataFrame()
and try various methods of getting the output desired
#this results in have teh index duplicated for each time a new geodataframe is adhered
for file in files_mo_geojson:
data = geopandas.read_file(inputs_path/file)
gdf_mo = gdf_mo.append(data,ignore_index=True)
or
#this results in only the first geodataframe being read in
for file in files_mo_geojson:
data = geopandas.read_file(inputs_path/file)
geopandas.GeoDataFrame(pandas.concat([gdf_mo,data], ignore_index=True))
or
#this results in all geodataframes being read in but only the first onen listed retaining values
for file in files_mo_geojson:
data = geopandas.read_file(inputs_path/file)
gdf_mo = gdf_mo.append(data,ignore_index=True).dissolve(by='label',aggfunc='sum')
or
#this results in only the first geodataframe being read in
for file in files_mo_geojson:
data = geopandas.read_file(inputs_path/file)
gdf_mo = geopandas.GeoDataFrame(gdf_mo.merge(data,left_on='geometry',right_on='geometry'))

Related

How to make dataframe from different parts of an Excel sheet given specific keywords?

I have one Excel file where multiple tables are placed in same sheet. My requirement is to read certain tables based on keyword. I have read tables using skip rows and nrows method, which is working as of now, but in future it won't work due to dynamic table length.
Is there any other workaround apart from skip rows & nrows method to read table as shown in picture?
I want to read data1 as one table & data2 as another table. Out of which in particular I want columns "RR","FF" & "WW" as two different data frames.
Appreciate if some one can help or guide to do this.
Method I have tried:
all_files=glob.glob(INPATH+"*sample*")
df1 = pd.read_excel(all_files[0],skiprows=11,nrows= 3)
df2 = pd.read_excel(all_files[0],skiprows=23,nrows= 3)
This works fine, the only problem is table length will vary every time.
With an Excel file identical to the one of your image, here is one way to do it:
import pandas as pd
df = pd.read_excel("file.xlsx").dropna(how="all").reset_index(drop=True)
# Setup
targets = ["Data1", "Data2"]
indices = [df.loc[df["Unnamed: 0"] == target].index.values[0] for target in targets]
dfs = []
for i in range(len(indices)):
# Slice df starting from first indice to second one
try:
data = df.loc[indices[i] : indices[i + 1] - 1, :]
except IndexError:
data = df.loc[indices[i] :, :]
# For one slice, get only values where row starts with 'rr'
r_idx = data.loc[df["Unnamed: 0"] == "rr"].index.values[0]
data = data.loc[r_idx:, :].reset_index(drop=True).dropna(how="all", axis=1)
# Cleanup
data.columns = data.iloc[0]
data.columns.name = ""
dfs.append(data.loc[1:, :].iloc[:, 0:3])
And so:
for item in dfs:
print(item)
# Output
rr ff ww
1 car1 1000000 sellout
2 car2 1500000 to be sold
3 car3 1300000 sellout
rr ff ww
1 car1 1000000 sellout
2 car2 1500000 to be sold
3 car3 1300000 sellout

Linestring end coordinates in a .csv file along with source and target id

I have a Digital Road Map dataset containing coordinates of nodes interconnected through road network and Node number.
Dataset in xlsx
Dataset has three columns : Col1-**source **, Colm 2 Target and Column 3- geometry. geometry is a linestring of road having start point coordinate, an end point coordinate and few intermediate point coordinates. Source and Target columns are Node number of starting node and end node of each road network.
I want to extract only coordinate of starting node and end node from each row. Then arrange the filtered dataset such that each source and each target has the respective coordinates beside it.
The sample output file is
desired sample output
I am looking for code in shapely, most of the info is on one linestring. Since my data has more than a million rows so I am not able to find a relevant code that iterates through entire dataset.
your sample data is unusable as it is an image. Have simulated some
pick first and last point from LINESTRING
structure as columns (in df)
reshape df as df2 as your desired structure
import io
import shapely.geometry, shapely.wkt
import pandas as pd
import numpy as np
# sample data...
df = pd.read_csv(
io.StringIO(
'''source,target,geometry
0,100,"LINESTRING (5.897759230176348 49.44266714130711, 6.242751092156993 49.90222565367873, 5.674051954784829 49.5294835475575)"
1,101,"LINESTRING (13.59594567226444 48.87717194273715, 12.51844038254671 54.470370591848, 6.658229607783568 49.20195831969157)"
2,102,"LINESTRING (16.71947594571444 50.21574656839354, 23.42650841644439 50.30850576435745, 22.77641889821263 49.02739533140962, 14.60709842291953 51.74518809671997)"
3,103,"LINESTRING (18.62085859546164 54.68260569927078, 23.79919884613338 52.69109935160657, 20.89224450041863 54.31252492941253)"
4,104,"LINESTRING (5.606975945670001 51.03729848896978, 6.589396599970826 51.85202912048339, 3.31501148496416 51.34577662473805, 5.988658074577813 51.85161570902505)"
5,105,"LINESTRING (4.799221632515724 49.98537303323637, 6.043073357781111 50.12805166279423, 3.31501148496416 51.34577662473805, 6.15665815595878 50.80372101501058, 3.314971144228537 51.34575511331991)"
6,106,"LINESTRING (3.31501148496416 51.34577662473805, 3.830288527043137 51.62054454203195, 6.905139601274129 53.48216217713065, 4.705997348661185 53.09179840759776)"
7,107,"LINESTRING (7.092053256873896 53.14404328064489, 3.830288527043137 51.62054454203195, 6.842869500362383 52.22844025329755, 3.31501148496416 51.34577662473805)"
8,108,"LINESTRING (6.589396599970826 51.85202912048339, 6.905139601274129 53.48216217713065, 3.314971144228537 51.34575511331991, 5.988658074577813 51.85161570902505)"
9,109,"LINESTRING (5.606975945670001 51.03729848896978, 4.286022983425084 49.90749664977255)"'''
)
)
# pick first and last point from each linestring as columns
df = df.join(
df["geometry"]
.apply(lambda ls: np.array(shapely.wkt.loads(ls).coords)[[0, -1]])
.apply(
lambda x: {
f"{c}_point": shapely.geometry.Point(x[i])
for i, c in enumerate(df.columns)
if c != "geometry"
}
)
.apply(pd.Series)
)
# reshape to row wise
df2 = pd.melt(
df,
id_vars=["source", "target"],
value_vars=["source_point", "target_point"],
value_name="point",
)
df2["node_number"] = np.where(
df2["variable"] == "source_point", df2["source"], df2["target"]
)
df2 = df2.drop(columns=["source", "target", "variable"])
output
point
node_number
POINT (5.897759230176348 49.44266714130711)
0
POINT (13.59594567226444 48.87717194273715)
1
POINT (16.71947594571444 50.21574656839354)
2
POINT (18.62085859546164 54.68260569927078)
3
POINT (5.606975945670001 51.03729848896978)
4
POINT (4.799221632515724 49.98537303323637)
5
POINT (3.31501148496416 51.34577662473805)
6
POINT (7.092053256873896 53.14404328064489)
7
POINT (6.589396599970826 51.85202912048339)
8
POINT (5.606975945670001 51.03729848896978)
9
POINT (5.674051954784829 49.5294835475575)
100
POINT (6.658229607783568 49.20195831969157)
101
POINT (14.60709842291953 51.74518809671997)
102
POINT (20.89224450041863 54.31252492941253)
103
POINT (5.988658074577813 51.85161570902505)
104
POINT (3.314971144228537 51.34575511331991)
105
POINT (4.705997348661185 53.09179840759776)
106
POINT (3.31501148496416 51.34577662473805)
107
POINT (5.988658074577813 51.85161570902505)
108
POINT (4.286022983425084 49.90749664977255)
109
Do you mean:
df[['Start', 'end']] = df['geometry'].str.split(',', expand=True)

Add (insert) new columns to an existing file with different shape Pandas

I have 5 files dataframe, which are each files contains with different shape on pandas.
1st file contains of 3968 rows x 7 columns (Date,Open,High,Low,Close,Adj Close,Volume)
2nd file contains of 3774 rows x 7 columns (Date1,Open1,High1,Low1,Close1,Adj Close1,Volume1)
3rd file contains of 58 rows x 3 columns (No, Date, Rate)
4th file contains of 192 rows x 3 columns (No1, Date1, Rates1)
5th file contains of 1850 rows x 3 columns (No2, Date2,Rate2)
My Output will :
3968 rows x 16 columns
(Date,Open,High,Low,Close,Adj Close,Volume, Open1,High1,Low1,Close1,Adj Close1,Volume1, Rate, Rates1, Rates2)
How to append/insert the new columns on 1st file from 2nd - 5th files with diffent shapes?
is there any technique to match the different shapes?
I put my code :
df = pd.read_csv('^JKLQ45.csv') # 1st file
files = [file for file in os.listdir('./Raw Data')] #i put all the files
all_data = pd.DataFrame()
for file in files:
current_data = pd.read_csv('./Raw Data'+"/"+file)
all_data = pd.concat([all_data, current_data])
all_data.to_csv("all_data_copy.csv", index=False)
the output are 9842 rows × 14 columns, but i want the shape will be 3968 rows x 16 columns
Can you add this code inside the loop?
pd.concat([df1.reset_index(drop=True),df2.reset_index(drop=True)],axis=1)

map vectorised terms to the original dataframe

I have a dataframe column contains domain names i.e. newyorktimes.com. I split by '.' and apply CountVectorizer to "newyorktimes".
The dataframe
domain split country
newyorktimes.com newyorktimes usa
newyorkreport.com newyorkreport usa
"newyorktimes" is also added as a new dataframe column called 'split'
I'm able to get the term frequencies
vectoriser = CountVectorizer(analyzer='word', ngram_range=(2, 2), stop_words='english')
X = vectoriser.fit_transform(df['split'])
features = vectoriser.get_feature_names()
count = x.toarray().sum(axis=0)
dic = dict(zip(features, count))
dic = sorted(dic.items(), key=lambda x: x[1], reverse=True)
But I also need the 'country' information from the original dataframe and I don't know how to map the terms back to the original dataframe.
Expected output
term country domain count
new york usa 2
york times usa 1
york report usa 1
I cannot reproduce the example you provided, not very sure if you provided the correct input for the countvectorizer. If it is a matter of adding the count matrix back to the data frame, you can do it like this:
df = pd.DataFrame({'corpus':['This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?']
})
vectoriser = CountVectorizer(analyzer='word', ngram_range=(2, 2), stop_words='english')
X = vectoriser.fit_transform(df['corpus'])
features = vectoriser.get_feature_names()
pd.concat([df,pd.DataFrame(X.toarray(),columns=features,index=df.index)],axis=1)
corpus document second second document
0 This is the first document. 0 0
1 This document is the second document. 1 1
2 And this is the third one. 0 0
3 Is this the first document? 0 0

combine csv files, sort them by time and average the colums

I have many datasets in csv files they look like in the picture that I attached.
In the first column is always the time in minutes, but the time steps and the total number of rows differ between the raw data files. I'd like to have one output file (csv file) in which all the raw files are combined and sorted by the time. So that the time increases from the top to the bottom of the column.
raw data and output
The concentration column should be averaged, when more than one number exists.
I tried like this:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
d1 = pd.read_csv('dat1.txt', sep="\t")
d2 = pd.read_csv('dat2.txt', sep="\t")
d1.columns
d2.columns
merged_outer = pd.merge(d1,d2, on='time', how='outer')
print merged_outer
but it doesn't lead to the correct output. I'm a beginner in Pandas but I hope I explaind the problem well enough. Thank you for any idea or suggestion!
Thank you for your idea. Unfortunately, when I run it I get an error message saying that dat1.txt doesn't exist. This seems strange to me as I read the raw files initially by:
d1 = pd.read_csv('dat1.txt', sep="\t")
d2 = pd.read_csv('dat2.txt', sep="\t")
Sorry, here the data as raw text:
raw data 1
time column2 column3 concentration
1 2 4 3
2 2 4 6
4 2 4 2
7 2 4 5
raw data 2
time column2 column3 concentration
1 2 4 6
2 2 4 2
8 2 4 9
10 2 4 5
12 2 4 7
Something like this might work
filenames = ['dat1.txt', 'dat2.txt',...]
dataframes = {filename: pd.read_csv(filename, sep="\t") for filename in filenames}
merged_outer = pd.concat(dataframes).groupby('time').mean()
When you pass a dict to pd.concat, it creates a MultiIndex DataFrame with the dict keys as level0