Merge value_counts of different pandas dataframes - pandas

I have a list of pandas dataframes in which i do the value_counts of a column and finally append all the results to another dataframe.
df_AB = pd.read_pickle('df_AB.pkl')
df_AC = pd.read_pickle('df_AC.pkl')
df_AD = pd.read_pickle('df_AD.pkl')
df_AE = pd.read_pickle('df_AE.pkl')
df_AF = pd.read_pickle('df_AF.pkl')
df_AG = pd.read_pickle('df_AG.pkl')
The format of the above dataframes is as below (Example: df_AB):
df_AB:
id is_valid
121 True
122 False
123 True
For every pandas dataframe, I would need to get the value_counts of is_valid column and store the results to df_result. I tried the below code but doesn't seem to work as expected.
df_AB_VC = df_AB['is_valid'].value_counts()
df_AB_VC['group'] = "AB"
df_AC_VC = df_AC['is_valid'].value_counts()
df_AC_VC['group'] = "AC"
Result dataframe (df_result):
Group is_valid_True_Count is_Valid_False_Count
AB 2 1
AC
AD
.
.
.
Any leads would be appreciated

I think you just need to work on the dataframes a bit more systematically:
groups = ['AB', 'AC', 'AD',...]
out = pd.DataFrame({
g: pd.read_pickle(f'df_{g}.pkl')['is_valid'].value_counts()
for g in groups
}).T

Do not use variables, that makes your code much more complicated. Use a container
files = ['df_AB.pkl', 'df_AC.pkl', 'df_AD.pkl', 'df_AE.pkl', 'df_AF.pkl']
# using the XX part in "df_XX.pkl", you need to adapt to your real use-case
dataframes = {f[3:5]: pd.read_pickle(f) for f in files}
# compute counts
counts = (pd.DataFrame({k: d['is_valid'].value_counts()
for k,d in dataframes.items()})
.T.add_prefix('is_valid_').add_suffix('_Count')
)
example output:
is_valid_True_Count is_valid_False_Count
AB 2 1
AC 2 1

Use pathlib to extract group name then collect data into dictionary before concatenate all entries:
import pandas as pd
import pathlib
data = {}
for pkl in pathlib.Path().glob('df_*.pkl'):
group = pkl.stem.split('_')[1]
df = pd.read_pickle(pkl)
data[group] = df['is_valid'].value_counts() \
.add_prefix('is_valid_') \
.add_suffix('_Count')
df = pd.concat(data, axis=1).T
>>> df
is_valid_True_Count is_valid_False_Count
AD 2 1
AB 4 2
AC 0 3

Related

Appending GeoDataFrames does not return expected dataframe

I have the following issue when trying to append dataframes containing geometry types. The pandas dataframe I am looking at looks likes this:
name x_zone y_zone
0 A1 65.422080 48.147850
1 A1 46.635708 51.165745
2 A1 46.597984 47.657444
3 A1 68.477700 44.073700
4 A3 46.635708 54.108190
5 A3 46.635708 51.844770
6 A3 63.309560 48.826878
7 A3 62.215572 54.108190
As you can see, there are four rows per name as these represent the corners of polygons. I need this to be in the the form of a polygon as defined in geopandas, i.e. I need a GeoDataFrame. To do so, I use the following code for just one of the name (just to check it works):
df = df[df['name']=='A1']
x = df['x_zone'].to_list()
y = df['y_zone'].to_list()
polygon_geom = Polygon(zip(x, y))
crs = {'init': "EPSG:4326"}
polygon = gpd.GeoDataFrame(index=[name], crs=crs, geometry=[polygon_geom])
print(polygon)
which returns:
geometry
A1 POLYGON ((65.42208 48.14785, 46.63571 51.16575...
polygon.info()
<class 'geopandas.geodataframe.GeoDataFrame'>
Index: 1 entries, A1 to A1
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 geometry 1 non-null geometry
dtypes: geometry(1)
memory usage: 16.0+ bytes
So fa, so good. So, for more name I though the following would work:
unique_place = list(df['name'].unique())
GE = []
for name in unique_aisle:
f = df[df['id']==name]
x = f['x_zone'].to_list()
y = f['y_zone'].to_list()
polygon_geom = Polygon(zip(x, y))
crs = {'init': "EPSG:4326"}
polygon = gpd.GeoDataFrame(index=[name], crs=crs, geometry=[polygon_geom])
print(polygon.info())
GE.append(polygon)
But it returns a list, not a dataframe.
[ geometry
A1 POLYGON ((65.42208 48.14785, 46.63571 51.16575...,
geometry
A3 POLYGON ((46.63571 54.10819, 46.63571 51.84477...]
THis is strange, because *.append(**) works very well if what is to be appended is a pandas dataframe.
What am I missing? Also, even in the first case, I am left with only the geometry column, but that is not an issue because I can write the file to a shp and read it again to have a resecond column (name).
Grateful for any solution that'll get me going!
I guess you need an example code using groupby on your data. Let me know if it is not the case.
from io import StringIO
import geopandas as gpd
import pandas as pd
from shapely.geometry import Polygon
import numpy as np
dats_str = """index id x_zone y_zone
0 A1 65.422080 48.147850
1 A1 46.635708 51.165745
2 A1 46.597984 47.657444
3 A1 68.477700 44.073700
4 A3 46.635708 54.108190
5 A3 46.635708 51.844770
6 A3 63.309560 48.826878
7 A3 62.215572 54.108190"""
# read the string, convert to dataframe
df1 = pd.read_csv(StringIO(dats_str), sep='\s+', index_col='index')
# Use groupBy as an iterator to:-
# - collect interested items
# - process some data: mean, creat Polygon, maybe others
# - all are collected/appended as lists
ids = []
counts = []
meanx = []
meany = []
list_x = []
list_y = []
polygon = []
for label, group in df1.groupby('id'):
# label: 'A1', 'A3';
# group: dataframe of 'A', of 'B'
ids.append(label)
counts.append(len(group)) #number of rows
meanx.append(group.x_zone.mean())
meany.append(group.y_zone.mean())
# process x,y data of this group -> for polygon
xs = group.x_zone.values
ys = group.y_zone.values
list_x.append(xs)
list_y.append(ys)
polygon.append(Polygon(zip(xs, ys))) # make/collect polygon
# items above are used to create a dataframe here
df_from_groupby = pd.DataFrame({'id': ids, 'counts': counts, \
'meanx': meanx, "meany": meany, \
'list_x': list_x, 'list_y': list_y,
'polygon': polygon
})
If you print the dataframe df_from_groupby, you will get:-
id counts meanx meany \
0 A1 4 56.783368 47.761185
1 A3 4 54.699137 52.222007
list_x \
0 [65.42208, 46.635708, 46.597984, 68.4777]
1 [46.635708, 46.635708, 63.30956, 62.215572]
list_y \
0 [48.14785, 51.165745, 47.657444, 44.0737]
1 [54.10819, 51.84477, 48.826878, 54.10819]
polygon
0 POLYGON ((65.42207999999999 48.14785, 46.63570...
1 POLYGON ((46.635708 54.10819, 46.635708 51.844...

Aggregate pandas df to get max and min as column

My dataframe is below:
import pandas as pd
inp = [{'c1':10,'c2':100,'c3':100}, {'c1':10,'c2':100,'c3':110}, {'c1':10,'c2':100,'c3':120}, {'c1':11,'c2':100,'c3':100}, {'c1':11,'c2':100,'c3':110}, {'c1':11,'c2':100, 'c3':120}]
df = pd.DataFrame(inp)
This is how I am aggregating
new_df = df.groupby(['c1', 'c2']).agg({"c3": [min,max]})
But the output is not as per my expectation. My expectation is as below:
inp = [{'c1':10, 'c2':100,'c3_min':100, 'c3_max':120}, {'c1':11, 'c2':100,'c3_min':100, 'c3_max':120}]
df = pd.DataFrame(inp)
What am I doing wrong? how can I reach my expected output?
Try:
# tell Pandas to use the vectorized functions with `'min', 'max'`
# instead of `min` and `max`
new_df = df.groupby('c1', as_index=False)['c2'].agg(['min','max'])
Or to match the output:
new_df = (df.groupby('c1')['c2']
.agg(['min','max'])
.add_prefix('c2_')
.reset_index()
)
An alternative would be to keep your current code and flatten the index with pandas.MultiIndex.to_flat_index
:
# Flatten the column index
new_df.columns = new_df.columns.to_flat_index()
# From tuples to string
new_df.rename(columns='_'.join, inplace=True)
# Reset the index
new_df.reset_index(inplace=True)
Prints:
c1 c2_min c2_max
0 10 100 120
1 11 100 120

saving dataframe groupby rows to exactly two lines

I got a dataframe and I want to groupby the rows based on a specific column. Number of rows in each group will be at least 4 and at most 50. I want to save one column from the group into two lines. If the groupsize is even, let us say 2n, then n rows in one line and the remaining n in the second line. If it is odd, n+1 and n or n and n+1 will do.
For example,
import pandas as pd
from io import StringIO
data = """
id,name
1,A
1,B
1,C
1,D
2,E
2,F
2,ds
2,G
2, dsds
"""
df = pd.read_csv(StringIO(data))
I want to groupby id
df.groupby('id',sort=False)
and then get a dataframe like
id name
0 1 A B
1 1 C D
2 2 E F ds
3 2 G dsds
Probably not the most efficient solution, but it works:
import numpy as np
df = df.sort_values('id')
# next 3 lines: for each group find the separation
df['range_idx'] = range(0, df.shape[0])
df['mean_rank_group'] = df.groupby(['id'])['range_idx'].transform(np.mean)
df['separate_column'] = df['range_idx'] < df['mean_rank_group']
# groupby itself with the help of additional column
df.groupby(['id', 'separate_column'], as_index=False)['name'].agg(','.join).drop(
columns='separate_column')
This is a bit convoluted approach but it does the work;
def func(s: pd.Series):
mid = max(s.shape[0]//2 ,1)
l1 = ' '.join(list(s[:mid]))
l2 = ' '.join(list(s[mid:]))
return [l1, l2]
df_new = df.groupby('id').agg(func)
df_new["name1"]= df_new["name"].apply(lambda x: x[0])
df_new["name2"]= df_new["name"].apply(lambda x: x[1])
df = df_new.drop(labels="name", axis=1).stack().reset_index().drop(labels = ["level_1"], axis=1).rename(columns={0:"name"}).set_index("id")

How to quickly normalise data in pandas dataframe?

I have a pandas dataframe as follows.
import pandas as pd
df = pd.DataFrame({
'A':[1,2,3],
'B':[100,300,500],
'C':list('abc')
})
print(df)
A B C
0 1 100 a
1 2 300 b
2 3 500 c
I want to normalise the entire dataframe. Since column C is not a numbered column what I do is as follows (i.e. remove C first, normalise data and add the column).
df_new = df.drop('concept', axis=1)
df_concept = df[['concept']]
from sklearn import preprocessing
x = df_new.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df_new = pd.DataFrame(x_scaled)
df_new['concept'] = df_concept
However, I am sure that there is more easy way of doing this in pandas (given the column names that I do not need to normalise, then do the normalisation straightforward).
I am happy to provide more details if needed.
Use DataFrame.select_dtypes for DataFrame with numeric columns and then normalize with division by minimal and maximal values and then assign back only normalized columns:
df1 = df.select_dtypes(np.number)
df[df1.columns]=(df1-df1.min())/(df1.max()-df1.min())
print (df)
A B C
0 0.0 0.0 a
1 0.5 0.5 b
2 1.0 1.0 c
In case you want to apply any other functions on the data frame, you can use df[columns] = df[columns].apply(func).

Pandas data frame creation using static data

I have a data set like this : {'IT',[1,20,35,44,51,....,1000]}
I want to convert this into python/pandas data frame. I want to see output in the below format. How to achieve this output.
Dept Count
IT 1
IT 20
IT 35
IT 44
IT 51
.. .
.. .
.. .
IT 1000
Below way i can write, but this is not efficient way for huge data.
data = [['IT',1],['IT',2],['IT',3]]
df = pd.DataFrame(data,columns=['Dept','Count'])
print(df)
No need for a list comprehension since pandas will automatically fill IT in for every row.
import pandas as pd
d = {'IT':[1,20,35,44,51,1000]}
df = pd.DataFrame({'dept': 'IT', 'count': d['IT']})
Use list comprehension for tuples and pass to DataFrame constructor:
d = {'IT':[1,20,35,44,51], 'NEW':[1000]}
data = [(k, x) for k, v in d.items() for x in v]
df = pd.DataFrame(data,columns=['Dept','Count'])
print(df)
Dept Count
0 IT 1
1 IT 20
2 IT 35
3 IT 44
4 IT 51
5 NEW 1000
You can use melt
import pandas as pd
d = {'IT': [10]*100000}
df = pd.DataFrame(d)
df = pd.melt(df, var_name='Dept', value_name='Count')