Converting a flat table of records to an aggregate dataframe in Pandas [duplicate] - pandas

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 4 years ago.
I have a flat table of records about objects. Object have a type (ObjType) and are hosted in containers (ContainerId). The records also have some other attributes about the objects. However, they are not of interest at present. So, basically, the data looks like this:
Id ObjName XT ObjType ContainerId
2 name1 x1 A 2
3 name2 x5 B 2
22 name5 x3 D 7
25 name6 x2 E 7
35 name7 x3 G 7
..
..
92 name23 x2 A 17
95 name24 x8 B 17
99 name25 x5 A 21
What I am trying to do is 're-pivot' this data to further analyze which containers are 'similar' by looking at the types of objects they host in aggregate.
So, I am looking to convert the above data to the form below:
ObjType A B C D E F G
ContainerId
2 2.0 1.0 1.0 0.0 0.0 0.0 0.0
7 0.0 0.0 0.0 1.0 2.0 1.0 1.0
9 1.0 1.0 0.0 1.0 0.0 0.0 0.0
11 0.0 0.0 0.0 2.0 3.0 1.0 1.0
14 1.0 1.0 0.0 1.0 0.0 0.0 0.0
17 1.0 1.0 0.0 0.0 0.0 0.0 0.0
21 1.0 0.0 0.0 0.0 0.0 0.0 0.0
This is how I have managed to do it currently (after a lot of stumbling and using various tips from questions such as this one). I am getting the right results but, being new to Pandas and Python, I feel that I must be taking a long route. (I have added a few comments to explain the pain points.)
import pandas as pd
rdf = pd.read_csv('.\\testdata.csv')
#The data in the below group-by is all that is needed but in a re-pivoted format...
rdfg = rdf.groupby('ContainerId').ObjType.value_counts()
#Remove 'ContainerId' and 'ObjType' from the index
#Had to do reset_index in two steps because otherwise there's a conflict with 'ObjType'.
#That is, just rdfg.reset_index() does not work!
rdx = rdfg.reset_index(level=['ContainerId'])
#Renaming the 'ObjType' column helps get around the conflict so the 2nd reset_index works.
rdx.rename(columns={'ObjType':'Count'}, inplace=True)
cdx = rdx.reset_index()
#After this a simple pivot seems to do it
cdf = cdx.pivot(index='ContainerId', columns='ObjType',values='Count')
#Replacing the NaNs because not all containers have all object types
cdf.fillna(0, inplace=True)
Ask: Can someone please share other possible approaches that could perform this transformation?

This is a use case for pd.crosstab. Docs.
e.g.
In [539]: pd.crosstab(df.ContainerId, df.ObjType)
Out[539]:
ObjType A B D E G
ContainerId
2 1 1 0 0 0
7 0 0 1 1 1
17 1 1 0 0 0
21 1 0 0 0 0

Related

Dynamic sum of one column based on NA values of another column in Pandas

I've got an ordered dataframe, df. It's grouped by 'ID' and ordered by 'order'
df = pd.DataFrame(
{'ID': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A','A', 'A','A', 'B','B', 'B','B', 'B', 'B', 'B','B'],
'order': [1,3,4,6,7,9,11,12,13,14,15,16,19,25,8,10,15,17,20,25,29,31],
'col1': [1,2,np.nan, 1,2,3,4,5, np.nan, np.nan,6,7,8,9,np.nan,np.nan,np.nan,10,11,12,np.nan,13],
'col2': [1,5,6,np.nan,1,2,3,np.nan,2,3,np.nan,np.nan,3,1,5,np.nan,np.nan, np.nan,2,3, np.nan,np.nan],
}
)
In each ID group, I would need to sum col1 for those rows that have col2 as NA. The sum includes the value of col1 for which the next value of col2 exists:
I prefer a vecterised solution to make it fast, but it could be difficult.
i need to use this in a groupby (as col1_dynamic_sum should be grouped by ID)
What i have done so far, is define a function that helps count the number of previous consecutive NAs in the row:
def count_prev_consec_na(input_col):
"""
This function takes a dataframe Series (column) and outputs the number of consecutive misisng values in previous rows
"""
try:
a1 = input_col.isna() + 0 ## missing
a2 = ~input_col.isna() + 0 ## not missing
b1 = a1.shift().fillna(0) ## prev missing
d = a1.cumsum()
e = b1*a2
f = d*e
g = f.replace(0, np.nan)
h=g.ffill()
h = h.fillna(0)
i = h.shift()
result = h-i
result = result.fillna(0)
return (result)
except Exception as e:
print(e.message)
return None
I think one solution is to use this to get a dynamic number of rows that needs to be rolled back for sum:
df['roll_back_count'] = df.groupby(['ID'], as_index = False).col2.transform(count_prev_consec_na)
ID order col1 col2 roll_back_count
A 1 1.0 1.0 0.0
A 3 2.0 5.0 0.0
A 4 NaN 6.0 0.0
A 6 1.0 NaN 0.0
A 7 2.0 1.0 1.0 ## I want to sum col1 of order 6 and 7 and remove order 6 row
A 9 3.0 2.0 0.0
A 11 4.0 3.0 0.0
A 12 5.0 NaN 0.0
A 13 NaN 2.0 1.0 ## I want to sum col1 of order 12 and 13 and remove order 12 row
A 14 NaN 3.0 0.0
A 15 6.0 NaN 0.0
A 16 7.0 NaN 0.0
A 19 8.0 3.0 2.0 ## I want to sum col1 of order 15,16,19 and remove order 15 and 16 rows
A 25 9.0 1.0 0.0
B 8 NaN 5.0 0.0
B 10 NaN NaN 0.0
B 15 NaN NaN 0.0
B 17 10.0 NaN 0.0 ## I want to sum col1 of order 10,15,17,20 and remove order 10,15,17 rows
B 20 11.0 2.0 3.0
B 25 12.0 3.0 0.0
B 29 NaN NaN 0.0
B 31 13.0 NaN 0.0
this is my desired output:
desired_output:
ID order col1_dynamic_sum col2
A 1 1.0 1
A 3 2.0 5
A 4 NaN 6
A 7 3.0 1
A 9 3.0 2
A 11 4.0 3
A 13 5.0 2
B 14 NaN 3
B 19 21.0 3
B 25 9.0 1
B 8 NaN 5
B 20 21.0 2
B 25 12.0 3
note: the sums should ignore NAs
again, i prefer vecterised solution, but it might not be possible due to the rolling effect.
Gah, I think I found a solution that doesn't involve rolling at all!
I created a new grouping ID based on NA values of the col2, using the index of rows that don't have any values. I would then use this grouping ID to aggregate!
def create_na_group(rollback_col):
a = ~rollback_col.isna() + 0
b = a.replace(0, np.nan)
c = rollback_col.index
d = c*b
d = d.bfill()
return(d)
df['na_group'] = df.groupby(['ID'], as_index = False).col2.transform(create_na_group)
df = df.loc[~df.na_group.isna()]
desired_output = df.groupby(['ID','na_group'], as_index=False).agg(
order = ('order', 'last')
, col1_dyn_sum = ('col1', sum)
, col2 = ('col2', sum)
)
I just have to find a way to make sure NaN don't become 0, like in rows 2,7 and 10.
ID na_group order col1_dyn_sum col2
0 A 0.0 1 1.0 1.0
1 A 1.0 3 2.0 5.0
2 A 2.0 4 0.0 6.0
3 A 4.0 7 3.0 1.0
4 A 5.0 9 3.0 2.0
5 A 6.0 11 4.0 3.0
6 A 8.0 13 5.0 2.0
7 A 9.0 14 0.0 3.0
8 A 12.0 19 21.0 3.0
9 A 13.0 25 9.0 1.0
10 B 14.0 8 0.0 5.0
11 B 18.0 20 21.0 2.0
12 B 19.0 25 12.0 3.0
I'll just creat two separate sum columns with lamba x: x.sum(skipna = False) and lamba x: x.sum(skipna = True) and then if the skipna = True sum column is 0 and skipna = False sum column is NA then I'll leave the final sum as NA, otherwise, I use the skipna = True sum column as the final desired output.

groupby shows unobserved values of non-categorical columns

I created this simple example to illustrate my issue:
x = pd.DataFrame({"int_var1": range(3), "int_var2": range(3, 6), "cat_var": pd.Categorical(["a", "b", "a"]), "value": [0.1, 0.2, 0.3]})
it yields this DataFrame:
int_var1 int_var2 cat_var value
0 3 a 0.1
1 4 b 0.2
2 5 a 0.3
where the first two columns are integers, the third column is categorical with two levels, and the fourth column is floats. The issue is that when I try to use groupby followed by agg it seems I only have two options, either I can show no unobserved values like so:
x.groupby(['int_var1', 'int_var2', 'cat_var'], observed = True).agg({"value": "sum"}).fillna(0)
int_var1 int_var2 cat_var value
0 3 a 0.1
1 4 b 0.2
2 5 a 0.3
or I can show unobserved values for all grouping variables like so:
x.groupby(['int_var1', 'int_var2', 'cat_var'], observed = False).agg({"value": "sum"}).fillna(0)
int_var1 int_var2 cat_var value
0 3 a 0.1
b 0.0
4 a 0.0
b 0.0
5 a 0.0
b 0.0
1 3 a 0.0
b 0.0
4 a 0.0
b 0.2
5 a 0.0
b 0.0
2 3 a 0.0
b 0.0
4 a 0.0
b 0.0
5 a 0.3
b 0.0
Is there a way to show unobserved values for the categorical variables only and not every possible permutation of all grouping variables?
You can unstack the level of interest, cat_var in this case:
(x.groupby(['int_var1', 'int_var2', 'cat_var'],observed=True)
.agg({'value':'sum'})
.unstack('cat_var',fill_value=0)
)
Output:
value
cat_var a b
int_var1 int_var2
0 3 0.1 0.0
1 4 0.0 0.2
2 5 0.3 0.0

mode returns Exception: Must produce aggregated value

for this dataframe
values ii
0 3.0 4
1 0.0 1
2 3.0 8
3 2.0 5
4 2.0 1
5 3.0 5
6 2.0 4
7 1.0 8
8 0.0 5
9 1.0 1
This line returns "Must ptoduce aggregated values
bii2=df.groupby(['ii'])['values'].agg(pd.Series.mode)
While this line works
bii3=df.groupby('ii')['values'].agg(lambda x: pd.Series.mode(x)[0])
Could you explain why is that?
Problem is mode return sometimes 2 or more values, check solution with GroupBy.apply:
bii2=df.groupby(['ii'])['values'].apply(pd.Series.mode)
print (bii2)
ii
1 0 0.0
1 1.0
2 2.0
4 0 2.0
1 3.0
5 0 0.0
1 2.0
2 3.0
8 0 1.0
1 3.0
Name: values, dtype: float64
And pandas agg need scalar in output, so return error. So if select first value it working nice
bii3=df.groupby('ii')['values'].agg(lambda x: pd.Series.mode(x).iat[0])
print (bii3)
ii
1 0.0
4 2.0
5 0.0
8 1.0
Name: values, dtype: float64

How to select and calculate with value from specific variable in dataframe with pandas

I am running below code and get this:
import pandas as pd
pf=pd.read_csv("https://www.dropbox.com/s/08kuxi50d0xqnfc/demo.csv?dl=1")
x=pf[pf['fuv1'] == 0].count()*100/1892
x
id 0.528541
date 0.528541
count 0.528541
idade 0.528541
site 0.528541
baseline 0.528541
fuv1 0.528541
fuv2 0.475687
fuv3 0.528541
fuv4 0.475687
dtype: float64
What I want is just to get this result 0.528541 and forgot all the above results.
What to do?
Thanks.
If want count number of 0 values in column fuv1 use sum for count Trues which are processes like 1s:
print ((pf['fuv1'] == 0).sum())
10
x = (pf['fuv1'] == 0).sum()*100/1892
print (x)
0.528541226216
Explanation why different outputs - count exclude NaNs:
pf=pd.read_csv("https://www.dropbox.com/s/08kuxi50d0xqnfc/demo.csv?dl=1")
x=pf[pf['fuv1'] == 0]
print (x)
id date count idade site baseline fuv1 fuv2 fuv3 fuv4
0 0 4/1/2016 10 13 A 1 0.0 1.0 0.0 1.0
2 2 4/3/2016 9 5 C 1 0.0 NaN 0.0 1.0
3 3 4/4/2016 108 96 D 1 0.0 1.0 0.0 NaN
11 11 4/12/2016 6 13 C 1 0.0 1.0 1.0 0.0
13 13 4/14/2016 12 4 C 1 0.0 1.0 1.0 0.0
40 40 5/11/2016 14 7 C 1 0.0 1.0 1.0 1.0
41 41 5/12/2016 0 26 C 1 0.0 1.0 1.0 1.0
42 42 5/13/2016 10 15 C 1 0.0 1.0 1.0 1.0
60 60 5/31/2016 13 3 D 1 0.0 1.0 1.0 1.0
74 74 6/14/2016 15 7 B 1 0.0 1.0 1.0 1.0
print (x.count())
id 10
date 10
count 10
idade 10
site 10
baseline 10
fuv1 10
fuv2 9
fuv3 10
fuv4 9
dtype: int64
In [282]: pf.loc[pf['fuv1'] == 0, 'id'].count()*100/1892
Out[282]: 0.5285412262156448
import pandas as pd
pf=pd.read_csv("https://www.dropbox.com/s/08kuxi50d0xqnfc/demo.csv?dl=1")
x = (pf['fuv1'] == 0).sum()*100/1892
y=pf["idade"].mean()
l = "Performance"
k = "LTFU"
def test(l1,k1):
return pd.DataFrame({'a':[l1, k1], 'b':[x, y]})
df1 = test(l,k)
df1.columns = [''] * len(df1.columns)
df1.index = [''] * len(df1.index)
print(round(df1, 2))
Performance 0.53
LTFU 14.13

Pandas programming model for the rolling window indexing

I need an advice on the programming pattern and use of DataFrame for our data. We have thousands of small ASCII files that are the results of the particle tracking experiments (see www.openptv.net for details). Each file is a list of particles identified and tracked in that time instance. The name of the file is the number of the frame. For example:
ptv_is.10000 (i.e. frame no. 10000)
prev next x y z
-1 5 0.0 0.0 0.0
0 0 1.0 1.0 1.0
1 1 2.0 2.0 2.0
2 2 3.0 3.0 3.0
3 -2 4.0 4.0 4.0
ptv_is.10001 (i.e.next time frame, 10001)
1 2 1.1 1.0 1.0
2 8 2.0 2.0 2.0
3 14 3.0 3.0 3.0
4 -2 4.0 4.0 4.0
-1 3 1.5 1.12 1.32
0 -2 0.0 0.0 0.0
The columns of the ASCII files are: prev - is the row number of the particle in the previous frame, next is the row number of the particle in the next frame, x,y,z are coordinates of the particle. If the row index of 'prev' is -1 - the particle appeared in the current frame and doesn't have link back in time. If the 'next' is -2, then the particle doesn't have a link forward in time and the trajectory ends in this frame.
So we are reading these files into a single DataFrame with the same column headers plus we add an index of time, i.e. the frame number
prev next x y z time
-1 5 0.0 0.0 0.0 10000
0 0 1.0 1.0 1.0 10000
1 1 2.0 2.0 2.0 10000
2 2 3.0 3.0 3.0 10000
3 -2 4.0 4.0 4.0 10000
1 2 1.1 1.0 1.0 10001
2 8 2.0 2.0 2.0 10001
3 14 3.0 3.0 3.0 10001
4 -2 4.0 4.0 4.0 10001
-1 3 1.5 1.12 1.32 10001
0 -2 0.0 0.0 0.0 10001
Now the step were I find it difficult to find the best way of using DataFrame. If we could add an additional column, called trajectory_id, we'd be able later to reindex this DataFrame either by time (creating sub-groups of the particles in single time instance and learn their spatial distributions) or by the trajectory_id and then create trajectories (or linked particles and learn about their time evolution in space, e.g. x(t), y(t), z(t) for the same trajectory_id).
If the input is:
prev next x y z time
-1 5 0.0 0.0 0.0 10000
0 0 1.0 1.0 1.0 10000
1 1 2.0 2.0 2.0 10000
2 2 3.0 3.0 3.0 10000
3 -2 4.0 4.0 4.0 10000
1 2 1.1 1.0 1.0 10001
2 8 2.0 2.0 2.0 10001
3 14 3.0 3.0 3.0 10001
4 -2 4.0 4.0 4.0 10001
-1 3 1.5 1.12 1.32 10001
0 -2 0.0 0.0 0.0 10001
Then the result I need is:
prev next x y z time trajectory_id
-1 5 0.0 0.0 0.0 10000 1
0 0 1.0 1.0 1.0 10000 2
1 1 2.0 2.0 2.0 10000 3
2 2 3.0 3.0 3.0 10000 4
3 -2 4.0 4.0 4.0 10000 -999
1 2 1.1 1.0 1.0 10001 2
2 8 2.0 2.0 2.0 10001 3
3 14 3.0 3.0 3.0 10001 4
-1 -2 4.0 4.0 4.0 10001 -999
-1 3 1.5 1.1 1.3 10001 5
0 -2 0.0 0.0 0.0 10001 1
which means:
prev next x y z time trajectory_id
-1 5 0.0 0.0 0.0 10000 1 < - appeared first time, new id
0 0 1.0 1.0 1.0 10000 2 < - the same
1 1 2.0 2.0 2.0 10000 3 <- the same
2 2 3.0 3.0 3.0 10000 4 <- the same
3 -2 4.0 4.0 4.0 10000 -999 <- sort of NaN, there is no link in the next frame
1 2 1.1 1.0 1.0 10001 2 <- from row #1 in the time 10000, has an id = 2
2 8 2.0 2.0 2.0 10001 3 <- row #2 at previous time, has an id = 3
3 14 3.0 3.0 3.0 10001 4 < from row # 3, next on the row #14, id = 4
-1 -2 4.0 4.0 4.0 10001 -999 <- but linked, marked as NaN or -999
-1 3 1.5 1.1 1.3 10001 5 <- new particle, new id = 5 (new trajectory_id)
0 -2 0.0 0.0 0.0 10001 1 <- from row #0 id = 1
Hope this explains better what I'm looking for. The only problem is that I do not know how to have a rolling function through the rows of a DataFrame table, creating a new index column, trajectory_id.
For example, the simple application with lists is shown here:
http://nbviewer.ipython.org/7020209
Thanks for every hint on pandas use,
Alex
Neat! This problem is close to my heart; I also use pandas for particle tracking. This is not exactly the same problem I work on, but here's an untested sketch that offers some helpful pandas idioms.
results = []
first_loop = True
next_id = None
for frame_no, frame in pd.concat(list_of_dataframes).groupby('time'):
if first_loop:
frame['traj_id'] = np.arange(len(frame))
results.append(frame)
next_id = len(frame)
first_loop = False
continue
prev_frame = results[-1]
has_matches = frame['prev'] > 0 # boolean indexer
frame[has_matches]['traj_'id'] = prev_frame.iloc[frame[has_matches]['prev']]
count_unmatched = (~has_matches).sum()
frame[~has_matches]['traj_'id'] = np.arange(next_id, next_id + count_unmatched)
next_id += count_unmatched
results.append(frame)
pd.concat(results)
If I understands well you want to track the position of particles in space across time. You are dealing with data of five dimensions, so maybe a DataFrame it's not the best structure for your problem and you may think about the a panel structure, or a reduction of data.
Taking one particle, you have two posibilities, first treat coordinates as three different values so you need three fields or treat them as a whole, a tuple or a point object for example.
In first case you have time plus three values so you have four axes, you need a DataFrame. In the second case you have two axes so you can use a Series.
For multiple particles just use a particle_id and put all the DataFrames in a Panel or the Series in a DataFrame.
Once you know what data structure to use then it's time to put data in.
Read the files sequentialy and make a collection of 'live' particles, ex.:
{particle_id1: { time1: (x1,y1,z1), time2: (x2,y2,z2), ...}, ...}
When a new particle it's detected (-1 on prev) assign a new particle_id and put on the collection. When a particle 'deads' pop out of the collection and put the data in a Series and then add this Series to a particle DataFrame (or DataFrame / Panel).
You could also make an index of particles id and the next field to help recognizing ids:
{ next_position_of_last_file: particle_id, ... }
or
{ position_in_last_file: particle_id, ...}