Appending GeoDataFrames does not return expected dataframe - pandas

I have the following issue when trying to append dataframes containing geometry types. The pandas dataframe I am looking at looks likes this:
name x_zone y_zone
0 A1 65.422080 48.147850
1 A1 46.635708 51.165745
2 A1 46.597984 47.657444
3 A1 68.477700 44.073700
4 A3 46.635708 54.108190
5 A3 46.635708 51.844770
6 A3 63.309560 48.826878
7 A3 62.215572 54.108190
As you can see, there are four rows per name as these represent the corners of polygons. I need this to be in the the form of a polygon as defined in geopandas, i.e. I need a GeoDataFrame. To do so, I use the following code for just one of the name (just to check it works):
df = df[df['name']=='A1']
x = df['x_zone'].to_list()
y = df['y_zone'].to_list()
polygon_geom = Polygon(zip(x, y))
crs = {'init': "EPSG:4326"}
polygon = gpd.GeoDataFrame(index=[name], crs=crs, geometry=[polygon_geom])
print(polygon)
which returns:
geometry
A1 POLYGON ((65.42208 48.14785, 46.63571 51.16575...
polygon.info()
<class 'geopandas.geodataframe.GeoDataFrame'>
Index: 1 entries, A1 to A1
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 geometry 1 non-null geometry
dtypes: geometry(1)
memory usage: 16.0+ bytes
So fa, so good. So, for more name I though the following would work:
unique_place = list(df['name'].unique())
GE = []
for name in unique_aisle:
f = df[df['id']==name]
x = f['x_zone'].to_list()
y = f['y_zone'].to_list()
polygon_geom = Polygon(zip(x, y))
crs = {'init': "EPSG:4326"}
polygon = gpd.GeoDataFrame(index=[name], crs=crs, geometry=[polygon_geom])
print(polygon.info())
GE.append(polygon)
But it returns a list, not a dataframe.
[ geometry
A1 POLYGON ((65.42208 48.14785, 46.63571 51.16575...,
geometry
A3 POLYGON ((46.63571 54.10819, 46.63571 51.84477...]
THis is strange, because *.append(**) works very well if what is to be appended is a pandas dataframe.
What am I missing? Also, even in the first case, I am left with only the geometry column, but that is not an issue because I can write the file to a shp and read it again to have a resecond column (name).
Grateful for any solution that'll get me going!

I guess you need an example code using groupby on your data. Let me know if it is not the case.
from io import StringIO
import geopandas as gpd
import pandas as pd
from shapely.geometry import Polygon
import numpy as np
dats_str = """index id x_zone y_zone
0 A1 65.422080 48.147850
1 A1 46.635708 51.165745
2 A1 46.597984 47.657444
3 A1 68.477700 44.073700
4 A3 46.635708 54.108190
5 A3 46.635708 51.844770
6 A3 63.309560 48.826878
7 A3 62.215572 54.108190"""
# read the string, convert to dataframe
df1 = pd.read_csv(StringIO(dats_str), sep='\s+', index_col='index')
# Use groupBy as an iterator to:-
# - collect interested items
# - process some data: mean, creat Polygon, maybe others
# - all are collected/appended as lists
ids = []
counts = []
meanx = []
meany = []
list_x = []
list_y = []
polygon = []
for label, group in df1.groupby('id'):
# label: 'A1', 'A3';
# group: dataframe of 'A', of 'B'
ids.append(label)
counts.append(len(group)) #number of rows
meanx.append(group.x_zone.mean())
meany.append(group.y_zone.mean())
# process x,y data of this group -> for polygon
xs = group.x_zone.values
ys = group.y_zone.values
list_x.append(xs)
list_y.append(ys)
polygon.append(Polygon(zip(xs, ys))) # make/collect polygon
# items above are used to create a dataframe here
df_from_groupby = pd.DataFrame({'id': ids, 'counts': counts, \
'meanx': meanx, "meany": meany, \
'list_x': list_x, 'list_y': list_y,
'polygon': polygon
})
If you print the dataframe df_from_groupby, you will get:-
id counts meanx meany \
0 A1 4 56.783368 47.761185
1 A3 4 54.699137 52.222007
list_x \
0 [65.42208, 46.635708, 46.597984, 68.4777]
1 [46.635708, 46.635708, 63.30956, 62.215572]
list_y \
0 [48.14785, 51.165745, 47.657444, 44.0737]
1 [54.10819, 51.84477, 48.826878, 54.10819]
polygon
0 POLYGON ((65.42207999999999 48.14785, 46.63570...
1 POLYGON ((46.635708 54.10819, 46.635708 51.844...

Related

Dataframe to multiIndex for sktime format

I have a multivariate time series data which is in this format(pd.Dataframe with index on Time),
I am trying to use sktime, which requires the data to be in multi index format. On the above if i want to use a rolling window of 3 on above data. It requires it in this format. Here pd.Dataframe has multi-index on (instance,time)
I was thinking if it is possible to transform it to new format.
Edit here's a more straightforward and probably faster solution using row indexing
df = pd.DataFrame({
'time':range(5),
'a':[f'a{i}' for i in range(5)],
'b':[f'b{i}' for i in range(5)],
})
w = 3
w_starts = range(0,len(df)-(w-1)) #start positions of each window
#iterate through the overlapping windows to create 'instance' col and concat
roll_df = pd.concat(
df[s:s+w].assign(instance=i) for (i,s) in enumerate(w_starts)
).set_index(['instance','time'])
print(roll_df)
Output
a b
instance time
0 0 a0 b0
1 a1 b1
2 a2 b2
1 1 a1 b1
2 a2 b2
3 a3 b3
2 2 a2 b2
3 a3 b3
4 a4 b4
Here's one way to achieve the desired result:
# Create the instance column
instance = np.repeat(range(len(df) - 2), 3)
# Repeat the Time column for each value in A and B
time = np.concatenate([df.Time[i:i+3].values for i in range(len(df) - 2)])
# Repeat the A column for each value in the rolling window
a = np.concatenate([df.A[i:i+3].values for i in range(len(df) - 2)])
# Repeat the B column for each value in the rolling window
b = np.concatenate([df.B[i:i+3].values for i in range(len(df) - 2)])
# Create a new DataFrame with the desired format
new_df = pd.DataFrame({'Instance': instance, 'Time': time, 'A': a, 'B': b})
# Set the MultiIndex on the new DataFrame
new_df.set_index(['Instance', 'Time'], inplace=True)
new_df

Merge value_counts of different pandas dataframes

I have a list of pandas dataframes in which i do the value_counts of a column and finally append all the results to another dataframe.
df_AB = pd.read_pickle('df_AB.pkl')
df_AC = pd.read_pickle('df_AC.pkl')
df_AD = pd.read_pickle('df_AD.pkl')
df_AE = pd.read_pickle('df_AE.pkl')
df_AF = pd.read_pickle('df_AF.pkl')
df_AG = pd.read_pickle('df_AG.pkl')
The format of the above dataframes is as below (Example: df_AB):
df_AB:
id is_valid
121 True
122 False
123 True
For every pandas dataframe, I would need to get the value_counts of is_valid column and store the results to df_result. I tried the below code but doesn't seem to work as expected.
df_AB_VC = df_AB['is_valid'].value_counts()
df_AB_VC['group'] = "AB"
df_AC_VC = df_AC['is_valid'].value_counts()
df_AC_VC['group'] = "AC"
Result dataframe (df_result):
Group is_valid_True_Count is_Valid_False_Count
AB 2 1
AC
AD
.
.
.
Any leads would be appreciated
I think you just need to work on the dataframes a bit more systematically:
groups = ['AB', 'AC', 'AD',...]
out = pd.DataFrame({
g: pd.read_pickle(f'df_{g}.pkl')['is_valid'].value_counts()
for g in groups
}).T
Do not use variables, that makes your code much more complicated. Use a container
files = ['df_AB.pkl', 'df_AC.pkl', 'df_AD.pkl', 'df_AE.pkl', 'df_AF.pkl']
# using the XX part in "df_XX.pkl", you need to adapt to your real use-case
dataframes = {f[3:5]: pd.read_pickle(f) for f in files}
# compute counts
counts = (pd.DataFrame({k: d['is_valid'].value_counts()
for k,d in dataframes.items()})
.T.add_prefix('is_valid_').add_suffix('_Count')
)
example output:
is_valid_True_Count is_valid_False_Count
AB 2 1
AC 2 1
Use pathlib to extract group name then collect data into dictionary before concatenate all entries:
import pandas as pd
import pathlib
data = {}
for pkl in pathlib.Path().glob('df_*.pkl'):
group = pkl.stem.split('_')[1]
df = pd.read_pickle(pkl)
data[group] = df['is_valid'].value_counts() \
.add_prefix('is_valid_') \
.add_suffix('_Count')
df = pd.concat(data, axis=1).T
>>> df
is_valid_True_Count is_valid_False_Count
AD 2 1
AB 4 2
AC 0 3

saving dataframe groupby rows to exactly two lines

I got a dataframe and I want to groupby the rows based on a specific column. Number of rows in each group will be at least 4 and at most 50. I want to save one column from the group into two lines. If the groupsize is even, let us say 2n, then n rows in one line and the remaining n in the second line. If it is odd, n+1 and n or n and n+1 will do.
For example,
import pandas as pd
from io import StringIO
data = """
id,name
1,A
1,B
1,C
1,D
2,E
2,F
2,ds
2,G
2, dsds
"""
df = pd.read_csv(StringIO(data))
I want to groupby id
df.groupby('id',sort=False)
and then get a dataframe like
id name
0 1 A B
1 1 C D
2 2 E F ds
3 2 G dsds
Probably not the most efficient solution, but it works:
import numpy as np
df = df.sort_values('id')
# next 3 lines: for each group find the separation
df['range_idx'] = range(0, df.shape[0])
df['mean_rank_group'] = df.groupby(['id'])['range_idx'].transform(np.mean)
df['separate_column'] = df['range_idx'] < df['mean_rank_group']
# groupby itself with the help of additional column
df.groupby(['id', 'separate_column'], as_index=False)['name'].agg(','.join).drop(
columns='separate_column')
This is a bit convoluted approach but it does the work;
def func(s: pd.Series):
mid = max(s.shape[0]//2 ,1)
l1 = ' '.join(list(s[:mid]))
l2 = ' '.join(list(s[mid:]))
return [l1, l2]
df_new = df.groupby('id').agg(func)
df_new["name1"]= df_new["name"].apply(lambda x: x[0])
df_new["name2"]= df_new["name"].apply(lambda x: x[1])
df = df_new.drop(labels="name", axis=1).stack().reset_index().drop(labels = ["level_1"], axis=1).rename(columns={0:"name"}).set_index("id")

How to quickly normalise data in pandas dataframe?

I have a pandas dataframe as follows.
import pandas as pd
df = pd.DataFrame({
'A':[1,2,3],
'B':[100,300,500],
'C':list('abc')
})
print(df)
A B C
0 1 100 a
1 2 300 b
2 3 500 c
I want to normalise the entire dataframe. Since column C is not a numbered column what I do is as follows (i.e. remove C first, normalise data and add the column).
df_new = df.drop('concept', axis=1)
df_concept = df[['concept']]
from sklearn import preprocessing
x = df_new.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df_new = pd.DataFrame(x_scaled)
df_new['concept'] = df_concept
However, I am sure that there is more easy way of doing this in pandas (given the column names that I do not need to normalise, then do the normalisation straightforward).
I am happy to provide more details if needed.
Use DataFrame.select_dtypes for DataFrame with numeric columns and then normalize with division by minimal and maximal values and then assign back only normalized columns:
df1 = df.select_dtypes(np.number)
df[df1.columns]=(df1-df1.min())/(df1.max()-df1.min())
print (df)
A B C
0 0.0 0.0 a
1 0.5 0.5 b
2 1.0 1.0 c
In case you want to apply any other functions on the data frame, you can use df[columns] = df[columns].apply(func).

selection with hierarchical index - getting subset of the dataframe

I have a dataframe which represent a matrix. it is indexed by row number and column number, something like that:
arrays = [[1,1,1,2,2,2,3,3,3],[1,2,3,1,2,3,1,2,3]]
tuples = zip(*arrays)
index = MultiIndex.from_tuples(tuples, names=['row', 'col'])
df = DataFrame([100,99,98,97,96,95,94,93,92],index,columns=['score'])
score
row col
1 1 100
2 99
3 98
2 1 97
2 96
3 95
3 1 94
2 93
3 92
Now I'm trying to figure out how to select only cols 1 and 3 of row 1, meaning some code that will return:
score
row col
1 1 100
3 98
of course Im not looking for a code that explicitly selects 1 and 3, but rather the more general case, in which i will pass a list of level 0 indices and a list of level 1 indices, and will get back the appropriate subset.
I've tried:
k1 = 1
k2 = [1,3]
df.ix[k1,k2]
Which raise an error.
This does works:
df.ix[k1].ix[k2]
But only if k1 is scalar. if k1=[1,3] the proper subset is not retrieved, because the return dataframe is still indexed with level 0 index.
It deosnt look like what the author intended.. I see no reason why df.ix[k1,k2] (where k1 and k2 are scalars or vectors or a mix) shouldn't work. Am I missing something?
how about reindex()?
df.reindex([1,2], level=0).reindex([1,3], level=1)
For a more general solution, here is a similar question I answered before:
How to index into a pandas multindex with ix
I copy the code here:
import numpy as np
def ms(df, *args):
idx = df.index
for i, values in enumerate(args):
if values is not None:
if np.isscalar(values):
values = [values]
idx = idx.reindex(values, level=i)[0]
return df.ix[idx]
ms(df, [1,2], [1, 3])
But I think unstack() the matrix is better:
m = df.score.unstack()
m.loc[[1,2],[1,3]]