I have a pandas df "lty" of shape 366,1 that looks like this below with the first "time" col is a month or Jan in this case, with the second "time" column is a day of the month. This is the result of
lty = ylt.groupby([ylt.index.month, ylt.index.day]).mean()
I need to reset the index possibly or get the columns to be renamed like this below so that "lty" has shape 366,3:
LT Mean
time time
1 1 11.263604
2 11.971495
3 11.989080
4 12.558736
5 11.850899
month day LT Mean
1 1 11.263604
2 11.971495
3 11.989080
4 12.558736
5 11.850899
I have tried to reset the index and i get this error:
lty.reset_index()
ValueError: cannot insert time, already exists
Thank you since I am still learning how to manipulate columns, indexing.
When you groupby, rename grouping Index attributes that way they don't collide after the aggregation:
lty = (ylt.groupby([ylt.index.month.rename('month'), ylt.index.day.rename('day')])
.mean().reset_index())
#month day LT Mean
#1 1 11.263604
#1 2 11.971495
#1 3 11.989080
#1 4 12.558736
#1 5 11.850899
Related
I have an issue where trying to work with pandas' indexing, this first happened on a larger set and i was able to recreate it in this dummy dataframe. Apologies if my table formatting is terrible, I don't know how to make it better visually.
Unnamed: 0 col1 col2 col3
0 Name Sun Mon Tue
1 one 1 2 1
2 two 4 4 3
3 three 2 1 1
4 four 1 5 5
5 five 1 5 5
6 six 5 1 1
7 seven 5 5 6
8 eight 5 3 4
9 nine 5 3 3
So what i am trying to do is to rename the 1st column label ('Unnamed: 0') to something meaningful, but then when i finally try to reset_index, the index "column" has the name "test" for some reason, while the first actual column gets the label "index".
df.rename({df.columns[0]: 'test'}, axis=1, inplace=True)
df.set_index('test', inplace=True)
dft = df.transpose()
dft
test Name one two three four five six seven eight nine
col1 Sun 1 4 2 1 1 5 5 5 5
col2 Mon 2 4 1 5 5 1 5 3 3
col3 Tue 1 3 1 5 5 1 6 4 3
First of all, if my understanding is correct, index is not even an actual column in the dataframe, why does it get to have a label when resetting index?
And more importantly, why are the labels "test" and "index" reversed?
dft.reset_index(inplace=True)
dft
test index Name one two three four five six seven eight nine
0 col1 Sun 1 4 2 1 1 5 5 5 5
1 col2 Mon 2 4 1 5 5 1 5 3 3
2 col3 Tue 1 3 1 5 5 1 6 4 3
I have tried every possible combination of set_index / reset_index i can think of, trying drop=True & inplace=True but i cannot find a way to create a proper index, like the one i started with.
Yes, the axis (index and column axis) can have names.
This is useful for multi-indexing.
When you call .reset_index, the index is extracted into a new column, which is named how your index was named (by default, 'index').
If you want, you can reset and rename index in one line:
df.rename_axis('Name').reset_index()
Why is 'test' printed not where I expect?
After your code, if you print(dft.columns), you will see:
Index(['index', 'Name', 'one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine'],
dtype='object',
name='test')
There are 11 columns. The column axis is called 'test' (see name='test' in the output above).
Also: print(dft.columns.name) prints test.
So what you actually see when you print your dataframe are the column names, to the left of which is the name of the column axis: 'test'.
It is NOT how the index axis is named. You can check: print(type(dft.index.name)) prints <class 'NoneType'>.
Now, why is column axis named 'test'?
Let's see how it got there step by step.
df.rename({df.columns[0]: 'test'}, axis=1, inplace=True)
First column is now named 'test'.
df.set_index('test', inplace=True)
First column has moved from being a column to being an index. The index name is 'test'. The old index disappeared.
dft = df.transpose()
The column axis is now named 'test'. The index is now named however the column axis was named before transposing.
I have the following dataframe:
d = {'value': [1,1,1,1,1,1,1,1,1,1], 'flag_1': [0,1,0,1,1,1,0,1,1,1],'flag_2':[1,0,1,1,1,1,1,0,1,1],'index':[1,2,3,4,5,6,7,8,9,10]}
df = pd.DataFrame(data=d)
I need to perform the following filter on it:
If flag 1 and flag 2 are equal keep the row with the maximum index from the consecutive indices. Below for rows 4,5,6 and rows 9,10 flag 1 and flag 2 are equal. From the group of consecutive indices 4,5,6 therefore I wish to keep only row 6 and drop rows 4 and 5. For the next group of rows 9 and 10 I wish to keep only row 10. The rows where flag 1 and 2 are not equal should all be retained. I want my final output to look as shown below:
I am really not sure how to achieve what is required so I would be grateful for any advice on how to do it.
IIUC, you can compare consecutive rows with shift. This solution requires a sorted index.
In [5]: df[~df[['flag_1', 'flag_2']].eq(df[['flag_1', 'flag_2']].shift(-1)).all(axis=1)]
Out[5]:
value flag_1 flag_2 index
0 1 0 1 1
1 1 1 0 2
2 1 0 1 3
5 1 1 1 6
6 1 0 1 7
7 1 1 0 8
9 1 1 1 10
My problem is similar to split a dataframe into chunks of N rows problem, expect that the number of rows in each chunk will be different. I have a datafame as such:
A
B
C
1
2
0
1
2
1
1
2
2
1
2
0
1
2
1
1
2
2
1
2
3
1
2
4
1
2
0
A and B are just whatever don't pay attention. Column C though starts at 0 and increments with each row until it suddenly resets to 0. So in the dataframe included the first 3 rows are a new dataframe, then the next 5 are a second new dataframe, and this continues as my dataframe adds more and more rows.
To finish off the question,
df = [x for _, x in df.groupby(df['C'].eq(0).cumsum())]
allows me to group all the subgroups and then with this groupby I can select each subgroups as a separate dataframe.
I have the following dataframe df, which specifies latitudes and longitudes for a certain groupnumber:
latitude longitude group
0 51.822231 4.700267 1
1 51.822617 4.801417 1
2 51.823235 4.903300 1
3 51.823433 5.003917 1
4 51.823616 5.504467 1
5 51.822231 3.900267 2
6 51.822617 3.901417 2
7 51.823235 3.903300 2
8 51.823433 6.903917 2
9 51.823616 8.904467 2
10 51.822231 1.900267 3
11 51.822617 2.901417 3
12 51.823235 11.903300 3
13 51.823433 12.903917 3
14 51.823616 13.904467 3
Within each groupnumber I try to find the lower and upper neighbour of the column 'longitude' for a specified value longitude_value = 5.00. All longitudes within each group 'trips' are sorted in df (they ascend in each group)
Per row I want to have the upper and lower neighbour values of longitude=5.000000. The desired output looks like:
latitude longitude trip
2 51.823235 4.903300 1
3 51.823433 5.003917 1
7 51.823235 3.903300 2
8 51.823433 6.903917 2
11 51.822617 2.901417 3
12 51.823235 11.903300 3
From this result I want to rearrange the data a little bit as:
lat_lo lat_up lon_lo lon_up
0 51.823235 51.823433 4.903300 5.003917
1 51.823235 51.823433 3.903300 6.903917
2 51.822617 51.823235 2.901417 11.903300
Hope I got your question right. See my attempt below. Made it long to be explicit in my approach. I could have easily introduced a longitude value of 5.00 and sliced on index but that would have complicated answering part 2 of your question. If I missed something, let me know.
Data
df=pd.read_clipboard()
df
Input point and calculate difference with longitude
fn=5.00
df['dif']=(df['longitude']-fn)
df
Find the minimum positive difference in each group
df1=df[df['dif'] > 0].groupby('group').min().reset_index().reindex()
Find the minimum negative difference in each group
df2=df[df['dif'] < 0].groupby('group').max().reset_index().reindex()
Append the second group above to the first into one df. This answers your question 1
df3=df1.append(df2, ignore_index=True).sort_values(['group','longitude'])
df3
Question 2
Introduce a column called status and append a pattern, 3 for the lower neighbor and 4 for the upper neighbor
df3['Status']=0
np.put(df3['Status'], np.arange(len(df3)), ['3','4'])
df3.drop(columns=['dif'], inplace=True)
df3
Rename the neighbours to lon_lo and lon_up
df3['Status']=np.where(df3['Status']==3,'lon_lo', (np.where(df3['Status']==4,'lon_up',df3['Status'] )))
Using pivot, break up the dataframe into lon_lo and latitude and do the same to lon_up. The rational here is to break up latitudes into two groups lo and up
first group break
df4=df3[df3['Status']=='lon_lo']
result=df4.pivot_table('longitude',['latitude','group'],'Status').reset_index().set_index('group')
second group break
df4=df3[df3['Status']=='lon_up']
result1=df4.pivot_table('longitude',['latitude','group'],'Status').reset_index().set_index('group')
Merge on index the two groups while renaming the latitudes to lo and up
final=result1.merge(result, left_index=True, right_index=True, suffixes=('_lo','_up'))
final
Output
I have a frame, df:
Date A B C
x 1 1 1
y 1 1 1
z 1 1 1
The "Date" column is my index, and all timestamps are random times down to the second level. I want to remove all rows in the dataframe, except for the row that is the closest to the start of a new hour.
For example, if 12/15/16 15:16:12 is the earliest row in hour 15 of that date, I want every other row with a time stamp greater than that stamp to be deleted. I then want the process repeated for the next hour, and so on.
Is this possible in a fast manner in pandas?
Thanks
You can using groupby and head after sort_index
df.sort_index().groupby(df.index.strftime('%Y-%m-%d %H')).head(1)
Out[83]:
A
Date
2016-12-15 15:16:12 1
Data input
df
Out[84]:
A
Date
2016-12-15 15:16:12 1
2016-12-15 15:19:12 1
2016-12-15 15:56:12 1