concatenate every n rows into one row pandas

concatenate every n rows into one row pandas - pandas

I have:
pd.DataFrame({'col':['one','fish','two','fish','left','foot','right','foot']})
col
0 one
1 fish
2 two
3 fish
4 left
5 foot
6 right
7 foot
I want to concatenate every n rows (here every 4) and form a new dataframe:
pd.DataFrame({'col':['one fish two fish','left foot right foot']})
col
0 one fish two fish
1 left foot right foot
I am using Python and pandas

If there is default RangeIndex then use integer division with aggregate join:
print (df.groupby(df.index // 4).agg(' '.join))
#for not RangeIndex create helper array
#print (df.groupby(np.arange(len(df)) // 4).agg(' '.join))
col
0 one fish two fish
1 left foot right foot
If want specify column col:
print (df.groupby(df.index // 4)['col'].agg(' '.join).to_frame())

Try groupby:
df['col'].groupby(np.repeat(np.arange(len(df)), 4)[:len(df)]).agg(' '.join)
Output:
0 one fish two fish
1 left foot right foot
Name: col, dtype: object

Related

Conditional merge and transformation of data in pandas

I have two data frames, and I want to create new columns in frame 1 using properties from frame 2
frame 1
Name
alice
bob
carol
frame 2
Name Type Value
alice lower 1
alice upper 2
bob equal 42
carol lower 0
desired result
frame 1
Name Lower Upper
alice 1 2
bob 42 42
carol 0 NA
Hence, the common column of both frames is Name. You can use Name to look up bounds (of a variable), which are specified in the second frame. Frame 1 lists each name exactly once. Frame 2 might have one or two entries per frame, which might either specify a lower or an upper bound (or both at a time if the type is equal). We do not need to have both bounds for each variable, one of the bounds can stay empty. I would like to have a frame that lists the range of each variable. I see how I can do that with for-loops over the columns, but that does not seem to be in the pandas spirit. Do you have any suggestions for a compact solution? :-)
Thanks in advance

You're not looking for a merge, but rather a pivot.
(df2[df2['Name'].isin(df1['Name'])]
.pivot('Name', 'Type', 'Value')
.reset_index()
)
But this doesn't handle the special 'equal' case.
For this, you can use a little trick. Replace 'equal' by a list with the other two values and explode to create the two rows.
(df2[df2['Name'].isin(df1['Name'])]
.assign(Type=lambda d: d['Type'].map(lambda x: {'equal': ['lower', 'upper']}.get(x,x)))
.explode('Type')
.pivot('Name', 'Type', 'Value')
.reset_index()
.convert_dtypes()
)
Output:
Name lower upper
0 alice 1 2
1 bob 42 42
2 carol 0 <NA>

counting unique values in column using sub-id

I have a df containing sub-trajectories (segments) of users, with mode of travel indicated by 0,1,2... which looks like this:
df = pd.read_csv('sample.csv')
df
id lat lon mode
0 5138001 41.144540 -8.562926 0
1 5138001 41.144538 -8.562917 0
2 5138001 41.143689 -8.563012 0
3 5138003 43.131562 -8.601273 1
4 5138003 43.132107 -8.598124 1
5 5145001 37.092095 -8.205070 0
6 5145001 37.092180 -8.204872 0
7 5145015 39.289341 -8.023454 2
8 5145015 39.197432 -8.532761 2
9 5145015 39.198361 -8.375641 2
In the above sample, id is for the segments but a full trajectory maybe covered by different modes (i.e. contains multiple segments).
So the first 4-digits of id is the unique trajectories, and the last 3-digits, unique segment with that trajectory.
I know that I can count the number of unique segments in the dfusing:
df.groupby('id').['mode'].nunique()
How do I then count the number of unique trajectories 5138, 5145, ...?

Use indexing for get first 4 values with str, if necessary first convert values to strings by Series.astype:
df = df.groupby(df['id'].astype(str).str[:4])['mode'].nunique().reset_index(name='count')
print (df)
id count
0 5138 2
1 5145 2
If need processing values after first 4 ids:
s = df['id'].astype(str)
df = s.str[4:].groupby(s.str[:4]).nunique().reset_index(name='count')
print (df)
id count
0 5138 2
1 5145 2
Another idea is use lambda function:
df.groupby(df['id'].apply(lambda x: str(x)[:4]))['mode'].nunique()

Calculating the difference between values based on their date

I have a dataframe that looks like this, where the "Date" is set as the index
A B C D E
Date
1999-01-01 1 2 3 4 5
1999-01-02 1 2 3 4 5
1999-01-03 1 2 3 4 5
1999-01-04 1 2 3 4 5
I'm trying to compare the percent difference between two pairs of dates. I think I can do the first bit:
start_1 = "1999-01-02"
end_1 = "1999-01-03"
start_2 = "1999-01-03"
end_2 = "1999-01-04"
Obs_1 = df.loc[end_1] / df.loc[start_1] -1
Obs_2 = df.loc[end_2] / df.loc[start_2] -1
The output I get from - eg Obs_1 looks like this:
A 0.011197
B 0.007933
C 0.012850
D 0.016678
E 0.007330
dtype: float64
I'm looking to build some correlations between Obs_1 and Obs_2. I think I need to create a new dataframe with the labels A-E as one column (or as the index), and then the data series from Obs_1 and Obs_2 as adjacent columns.
But I'm struggling! I can't 'see' what Obs_1 and Obs_2 'are' - have I created a list? A series? How can I tell? What would be the best way of combining the two into a single dataframe...say df_1.
I'm sure the answer is staring me in the face but I'm going mental trying to figure it out...and because I'm not quite sure what Obs_1 and Obs_2 'are', it's hard to search the SO archive to help me.
Thanks in advance

collapse pandas dataframe rows based on index column

I have a dataframe that contains information that is linked by an ID column. The rows are sequential with the odd rows containing a "start-point" and the even rows containing an "end" point. My goal is to collapse the data from these into a single row with columns for "start" and "end" following each other. The rows do have a "packet ID" that would link them if the sequential nature of the dataframe is not consistent.
example:
df:
0 1 2 3 4 5
0 hs6 106956570 106956648 ID_A1 60 -
1 hs1 153649721 153649769 ID_A1 60 -
2 hs1 865130744 865130819 ID_A2 0 -
3 hs7 21882206 21882237 ID_A2 0 -
4 hs1 74230744 74230819 ID_A3 0 +
5 hs8 92041314 92041508 ID_A3 0 +
The resulting dataframe that I am trying to achieve is:
new_df
0 1 2 3 4 5
0 hs6 106956570 106956648 hs1 153649721 153649769
1 hs1 865130744 865130819 hs7 21882206 21882237
2 hs1 74230744 74230819 hs8 92041314 92041508
with each row containing the information on both the start and the end-point.
I have tried to pass the IDs in to an array and use a for loop to pull the information out of the original dataframe into a new dataframe but this has not worked. I was looking at the melt documentation which would suggest that pd.melt(df, id_vars=[3], value_vars=[0,1,2]) may work but I cannot see how to get the corresponding row in to positions new_df[3,4,5].
I think that it may be something really simple that I am missing but any suggestions would be appreciated.

You can try this:
df_out = df.set_index([df.index%2, df.index//2])[df.columns[:3]]\
.unstack(0).sort_index(level=1, axis=1)
df_out.columns = np.arange(len(df_out.columns))
df_out
Output:
0 1 2 3 4 5
0 hs6 106956570 106956648 hs1 153649721 153649769
1 hs1 865130744 865130819 hs7 21882206 21882237
2 hs1 74230744 74230819 hs8 92041314 92041508

Create new column on pandas DataFrame in which the entries are randomly selected entries from another column

I have a DataFrame with the following structure.
df = pd.DataFrame({'tenant_id': [1,1,1,2,2,2,3,3,7,7], 'user_id': ['ab1', 'avc1', 'bc2', 'iuyt', 'fvg', 'fbh', 'bcv', 'bcb', 'yth', 'ytn'],
'text':['apple', 'ball', 'card', 'toy', 'sleep', 'happy', 'sad', 'be', 'u', 'pop']})
This gives the following output:
df = df[['tenant_id', 'user_id', 'text']]
tenant_id user_id text
1 ab1 apple
1 avc1 ball
1 bc2 card
2 iuyt toy
2 fvg sleep
2 fbh happy
3 bcv sad
3 bcb be
7 yth u
7 ytn pop
I would like to groupby on tenant_id and create a new column which is a random selection of strings from the user_id column.
Thus, I would like my output to look like the following:
tenant_id user_id text new_column
1 ab1 apple [ab1, bc2]
1 avc1 ball [ab1]
1 bc2 card [avc1]
2 iuyt toy [fvg, fbh]
2 fvg sleep [fbh]
2 fbh happy [fvg]
3 bcv sad [bcb]
3 bcb be [bcv]
7 yth u [pop]
7 ytn pop [u]
Here, random id's from the user_id column have been selected, these id's can be repeated as "fvg" is repeated for tenant_id=2. I would like to have a threshold of not more than ten id's. This data is just a sample and has only 10 id's to start with, so generally any number much less than the total number of user_id's. This case say 1 less than total user_id's that belong to a tenant.
i tried first figuring out how to select random subset of varying length with
df.sample
new_column = df.user_id.sample(n=np.random.randint(1, 10)))
I am kinda lost after this, assigning it to my df results in Nan's, probably because they are of variable lengths. Please help.
Thanks.

per my comment:
Your 'new column' is not a new column, it's a new cell for a single row.
If you want to assign the result to a new column, you need to create a new column, and apply the cell computation to it.
df['new column'] = df['user_id'].apply(lambda x: df.user_id.sample(n=np.random.randint(1, 10))))
it doesn't really matter what column you use for the apply since the variable is not used in the computation

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

concatenate every n rows into one row pandas - pandas

Try groupby: df['col'].groupby(np.repeat(np.arange(len(df)), 4)[:len(df)]).agg(' '.join) Output: 0 one fish two fish 1 left foot right foot Name: col, dtype: object

Related

Conditional merge and transformation of data in pandas

counting unique values in column using sub-id

Calculating the difference between values based on their date

collapse pandas dataframe rows based on index column

Create new column on pandas DataFrame in which the entries are randomly selected entries from another column

Categories

Resources