Pandas DataFrame, turn index and its name into column - pandas

I wanted to create DataFrame with 2 columns, one called 'id' , one called 'SalePrice'
submission = pd.DataFrame({'SalePrice':pre})
It looks like this
SalePrice
0 183242.025920
1 188796.451732
2 187878.763989
3 179789.672031
I know that I can name the index, but I need instead name it as a normal column name, on the same level as SalePrice. Anyone knows how to do that?

Try create it with DataFrame constructor
submission = pd.DataFrame({'SalePrice':pre,'id':np.arange(len(per))})

Just use reset_index, same as #Andy L. suggested. here's the full code:
submission = pd.DataFrame({'SalePrice':[1,2,3,4]}).reset_index()
submission.rename(columns = {'index':'id'}, inplace=True)
print(submission)
The output:
id SalePrice
0 0 1
1 1 2
2 2 3
3 3 4

Related

Create a new column based on another column in a dataframe

I have a df with multiple columns. One of my column is extra_type. Now i want to create a new column based on the values of extra_type column. For example
extra_type
NaN
legbyes
wides
byes
Now i want to create a new column with 1 and 0 if extra_type is not equal to wide then 1 else 0
I tried like this
df1['ball_faced'] = df1[df1['extra_type'].apply(lambda x: 1 if [df1['extra_type']!= 'wides'] else 0)]
It not working this way.Any help on how to make this work is appreciated
expected output is like below
extra_type ball_faced
NaN 1
legbyes 1
wides 0
byes 1
Note that there's no need to use apply() or a lambda as in the original question, since comparison of a pandas Series and a string value can be done in a vectorized manner as follows:
df1['ball_faced'] = df1.extra_type.ne('wides').astype(int)
Output:
extra_type ball_faced
0 NaN 1
1 legbyes 1
2 wides 0
3 byes 1
Here are links to docs for ne() and astype().
For some useful insights on when to use apply (and when not to), see this SO question and its answers. TL;DR from the accepted answer: "If you're not sure whether you should be using apply, you probably shouldn't."
df['ball_faced'] = df.extra_type.apply(lambda x: x != 'wides').astype(int)
extra_type
ball_faced
0
NaN
1
1
legbyes
1
2
wides
0
3
byes
1

pandas remove words beginning with specific char [duplicate]

I am transposing a data frame where I do not have defined column names and then need to drop rows from the transposed table where a given rows value in the first column (index 0) starts with ‘zrx’. I am thinking something like this should work, but can’t seem to get it working:
df[~df[0].str.startswitg("zrx")]
Input data looks like this (no headers):
Index 0 Index 1
zrx456. True
zrx567 false
abc234 True
Gfh123 False
nbv345 True
zrx456 False
zrx668 True
zrx789 True
My goal is to return this data frame with only the rows that start with zrx in column 0.
If you know the name of the first column, use
df[~df.Artist.str.startswith('zrx')]
If you do not know the name of the first column, use
df[~df.iloc[:,0].str.startswith('zrx')]
input
Artist Album Point
0 zrxAC1 A 1
1 AC2 B 2
2 zrxAC1 NaN 3
3 AC4 A 4
4 AC5 C 5
Output
Artist Album Point
1 AC2 B 2
3 AC4 A 4
4 AC5 C 5

counting unique values in column using sub-id

I have a df containing sub-trajectories (segments) of users, with mode of travel indicated by 0,1,2... which looks like this:
df = pd.read_csv('sample.csv')
df
id lat lon mode
0 5138001 41.144540 -8.562926 0
1 5138001 41.144538 -8.562917 0
2 5138001 41.143689 -8.563012 0
3 5138003 43.131562 -8.601273 1
4 5138003 43.132107 -8.598124 1
5 5145001 37.092095 -8.205070 0
6 5145001 37.092180 -8.204872 0
7 5145015 39.289341 -8.023454 2
8 5145015 39.197432 -8.532761 2
9 5145015 39.198361 -8.375641 2
In the above sample, id is for the segments but a full trajectory maybe covered by different modes (i.e. contains multiple segments).
So the first 4-digits of id is the unique trajectories, and the last 3-digits, unique segment with that trajectory.
I know that I can count the number of unique segments in the dfusing:
df.groupby('id').['mode'].nunique()
How do I then count the number of unique trajectories 5138, 5145, ...?
Use indexing for get first 4 values with str, if necessary first convert values to strings by Series.astype:
df = df.groupby(df['id'].astype(str).str[:4])['mode'].nunique().reset_index(name='count')
print (df)
id count
0 5138 2
1 5145 2
If need processing values after first 4 ids:
s = df['id'].astype(str)
df = s.str[4:].groupby(s.str[:4]).nunique().reset_index(name='count')
print (df)
id count
0 5138 2
1 5145 2
Another idea is use lambda function:
df.groupby(df['id'].apply(lambda x: str(x)[:4]))['mode'].nunique()

Calculating the difference between values based on their date

I have a dataframe that looks like this, where the "Date" is set as the index
A B C D E
Date
1999-01-01 1 2 3 4 5
1999-01-02 1 2 3 4 5
1999-01-03 1 2 3 4 5
1999-01-04 1 2 3 4 5
I'm trying to compare the percent difference between two pairs of dates. I think I can do the first bit:
start_1 = "1999-01-02"
end_1 = "1999-01-03"
start_2 = "1999-01-03"
end_2 = "1999-01-04"
Obs_1 = df.loc[end_1] / df.loc[start_1] -1
Obs_2 = df.loc[end_2] / df.loc[start_2] -1
The output I get from - eg Obs_1 looks like this:
A 0.011197
B 0.007933
C 0.012850
D 0.016678
E 0.007330
dtype: float64
I'm looking to build some correlations between Obs_1 and Obs_2. I think I need to create a new dataframe with the labels A-E as one column (or as the index), and then the data series from Obs_1 and Obs_2 as adjacent columns.
But I'm struggling! I can't 'see' what Obs_1 and Obs_2 'are' - have I created a list? A series? How can I tell? What would be the best way of combining the two into a single dataframe...say df_1.
I'm sure the answer is staring me in the face but I'm going mental trying to figure it out...and because I'm not quite sure what Obs_1 and Obs_2 'are', it's hard to search the SO archive to help me.
Thanks in advance

collapse pandas dataframe rows based on index column

I have a dataframe that contains information that is linked by an ID column. The rows are sequential with the odd rows containing a "start-point" and the even rows containing an "end" point. My goal is to collapse the data from these into a single row with columns for "start" and "end" following each other. The rows do have a "packet ID" that would link them if the sequential nature of the dataframe is not consistent.
example:
df:
0 1 2 3 4 5
0 hs6 106956570 106956648 ID_A1 60 -
1 hs1 153649721 153649769 ID_A1 60 -
2 hs1 865130744 865130819 ID_A2 0 -
3 hs7 21882206 21882237 ID_A2 0 -
4 hs1 74230744 74230819 ID_A3 0 +
5 hs8 92041314 92041508 ID_A3 0 +
The resulting dataframe that I am trying to achieve is:
new_df
0 1 2 3 4 5
0 hs6 106956570 106956648 hs1 153649721 153649769
1 hs1 865130744 865130819 hs7 21882206 21882237
2 hs1 74230744 74230819 hs8 92041314 92041508
with each row containing the information on both the start and the end-point.
I have tried to pass the IDs in to an array and use a for loop to pull the information out of the original dataframe into a new dataframe but this has not worked. I was looking at the melt documentation which would suggest that pd.melt(df, id_vars=[3], value_vars=[0,1,2]) may work but I cannot see how to get the corresponding row in to positions new_df[3,4,5].
I think that it may be something really simple that I am missing but any suggestions would be appreciated.
You can try this:
df_out = df.set_index([df.index%2, df.index//2])[df.columns[:3]]\
.unstack(0).sort_index(level=1, axis=1)
df_out.columns = np.arange(len(df_out.columns))
df_out
Output:
0 1 2 3 4 5
0 hs6 106956570 106956648 hs1 153649721 153649769
1 hs1 865130744 865130819 hs7 21882206 21882237
2 hs1 74230744 74230819 hs8 92041314 92041508