Dataframe to multiIndex for sktime format - pandas

I have a multivariate time series data which is in this format(pd.Dataframe with index on Time),
I am trying to use sktime, which requires the data to be in multi index format. On the above if i want to use a rolling window of 3 on above data. It requires it in this format. Here pd.Dataframe has multi-index on (instance,time)
I was thinking if it is possible to transform it to new format.

Edit here's a more straightforward and probably faster solution using row indexing
df = pd.DataFrame({
'time':range(5),
'a':[f'a{i}' for i in range(5)],
'b':[f'b{i}' for i in range(5)],
})
w = 3
w_starts = range(0,len(df)-(w-1)) #start positions of each window
#iterate through the overlapping windows to create 'instance' col and concat
roll_df = pd.concat(
df[s:s+w].assign(instance=i) for (i,s) in enumerate(w_starts)
).set_index(['instance','time'])
print(roll_df)
Output
a b
instance time
0 0 a0 b0
1 a1 b1
2 a2 b2
1 1 a1 b1
2 a2 b2
3 a3 b3
2 2 a2 b2
3 a3 b3
4 a4 b4

Here's one way to achieve the desired result:
# Create the instance column
instance = np.repeat(range(len(df) - 2), 3)
# Repeat the Time column for each value in A and B
time = np.concatenate([df.Time[i:i+3].values for i in range(len(df) - 2)])
# Repeat the A column for each value in the rolling window
a = np.concatenate([df.A[i:i+3].values for i in range(len(df) - 2)])
# Repeat the B column for each value in the rolling window
b = np.concatenate([df.B[i:i+3].values for i in range(len(df) - 2)])
# Create a new DataFrame with the desired format
new_df = pd.DataFrame({'Instance': instance, 'Time': time, 'A': a, 'B': b})
# Set the MultiIndex on the new DataFrame
new_df.set_index(['Instance', 'Time'], inplace=True)
new_df

Related

Appending GeoDataFrames does not return expected dataframe

I have the following issue when trying to append dataframes containing geometry types. The pandas dataframe I am looking at looks likes this:
name x_zone y_zone
0 A1 65.422080 48.147850
1 A1 46.635708 51.165745
2 A1 46.597984 47.657444
3 A1 68.477700 44.073700
4 A3 46.635708 54.108190
5 A3 46.635708 51.844770
6 A3 63.309560 48.826878
7 A3 62.215572 54.108190
As you can see, there are four rows per name as these represent the corners of polygons. I need this to be in the the form of a polygon as defined in geopandas, i.e. I need a GeoDataFrame. To do so, I use the following code for just one of the name (just to check it works):
df = df[df['name']=='A1']
x = df['x_zone'].to_list()
y = df['y_zone'].to_list()
polygon_geom = Polygon(zip(x, y))
crs = {'init': "EPSG:4326"}
polygon = gpd.GeoDataFrame(index=[name], crs=crs, geometry=[polygon_geom])
print(polygon)
which returns:
geometry
A1 POLYGON ((65.42208 48.14785, 46.63571 51.16575...
polygon.info()
<class 'geopandas.geodataframe.GeoDataFrame'>
Index: 1 entries, A1 to A1
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 geometry 1 non-null geometry
dtypes: geometry(1)
memory usage: 16.0+ bytes
So fa, so good. So, for more name I though the following would work:
unique_place = list(df['name'].unique())
GE = []
for name in unique_aisle:
f = df[df['id']==name]
x = f['x_zone'].to_list()
y = f['y_zone'].to_list()
polygon_geom = Polygon(zip(x, y))
crs = {'init': "EPSG:4326"}
polygon = gpd.GeoDataFrame(index=[name], crs=crs, geometry=[polygon_geom])
print(polygon.info())
GE.append(polygon)
But it returns a list, not a dataframe.
[ geometry
A1 POLYGON ((65.42208 48.14785, 46.63571 51.16575...,
geometry
A3 POLYGON ((46.63571 54.10819, 46.63571 51.84477...]
THis is strange, because *.append(**) works very well if what is to be appended is a pandas dataframe.
What am I missing? Also, even in the first case, I am left with only the geometry column, but that is not an issue because I can write the file to a shp and read it again to have a resecond column (name).
Grateful for any solution that'll get me going!
I guess you need an example code using groupby on your data. Let me know if it is not the case.
from io import StringIO
import geopandas as gpd
import pandas as pd
from shapely.geometry import Polygon
import numpy as np
dats_str = """index id x_zone y_zone
0 A1 65.422080 48.147850
1 A1 46.635708 51.165745
2 A1 46.597984 47.657444
3 A1 68.477700 44.073700
4 A3 46.635708 54.108190
5 A3 46.635708 51.844770
6 A3 63.309560 48.826878
7 A3 62.215572 54.108190"""
# read the string, convert to dataframe
df1 = pd.read_csv(StringIO(dats_str), sep='\s+', index_col='index')
# Use groupBy as an iterator to:-
# - collect interested items
# - process some data: mean, creat Polygon, maybe others
# - all are collected/appended as lists
ids = []
counts = []
meanx = []
meany = []
list_x = []
list_y = []
polygon = []
for label, group in df1.groupby('id'):
# label: 'A1', 'A3';
# group: dataframe of 'A', of 'B'
ids.append(label)
counts.append(len(group)) #number of rows
meanx.append(group.x_zone.mean())
meany.append(group.y_zone.mean())
# process x,y data of this group -> for polygon
xs = group.x_zone.values
ys = group.y_zone.values
list_x.append(xs)
list_y.append(ys)
polygon.append(Polygon(zip(xs, ys))) # make/collect polygon
# items above are used to create a dataframe here
df_from_groupby = pd.DataFrame({'id': ids, 'counts': counts, \
'meanx': meanx, "meany": meany, \
'list_x': list_x, 'list_y': list_y,
'polygon': polygon
})
If you print the dataframe df_from_groupby, you will get:-
id counts meanx meany \
0 A1 4 56.783368 47.761185
1 A3 4 54.699137 52.222007
list_x \
0 [65.42208, 46.635708, 46.597984, 68.4777]
1 [46.635708, 46.635708, 63.30956, 62.215572]
list_y \
0 [48.14785, 51.165745, 47.657444, 44.0737]
1 [54.10819, 51.84477, 48.826878, 54.10819]
polygon
0 POLYGON ((65.42207999999999 48.14785, 46.63570...
1 POLYGON ((46.635708 54.10819, 46.635708 51.844...

How do I use df.add_suffix to add suffixes to duplicate column names in Pandas?

I have a large dataframe with 400 columns. 200 of the column names are duplicates of the first 200. How can I used df.add_suffix to add a suffix only to the duplicate column names?
Or is there a better way to do it automatically?
Here is my solution, starting with:
df=pd.DataFrame(np.arange(4).reshape(1,-1),columns=['a','b','a','b'])
output
a b a b
0 1 2 3 4
Then I use Lambda function
df.columns += df.columns+np.vectorize(lambda x:'_' if x else '')(df.columns.duplicated())
Output
a b a_ b_
0 0 1 2 3
If you have more than one duplicate then you can loop until there is none left. This works for duplicated indices too, it also keeps the index name.
If I understand your question correct you have each name twice. If so it is possible to ask for duplicated values using df.columns.duplicated(). Then you can create a new list only modifying duplicated values and adding your self definied suffix. This is different from the other posted solution which modifies all entries.
df = pd.DataFrame(data=[[1, 2, 3, 4]], columns=list('aabb'))
my_suffix = 'T'
df.columns = [name if duplicated == False else name + my_suffix for duplicated, name in zip(df.columns.duplicated(), df.columns)]
df
>>>
a aT b bT
0 1 2 3 4
My answer has the disadvantage that the dataframe can have duplicated column names if one name is used three or more times.
You could do:
import pandas as pd
# setup dummy DataFrame with repeated columns
df = pd.DataFrame(data=[[1, 2, 3]], columns=list('aaa'))
# create unique identifier for each repeated column
identifier = df.columns.to_series().groupby(level=0).transform('cumcount')
# rename columns with the new identifiers
df.columns = df.columns.astype('string') + identifier.astype('string')
print(df)
Output
a0 a1 a2
0 1 2 3
If there is only one duplicate column, you could do:
# setup dummy DataFrame with repeated columns
df = pd.DataFrame(data=[[1, 2, 3, 4]], columns=list('aabb'))
# create unique identifier for each repeated column
identifier = df.columns.duplicated().astype(int)
# rename columns with the new identifiers
df.columns = df.columns.astype('string') + identifier.astype(str)
print(df)
Output (for only one duplicate)
a0 a1 b0 b1
0 1 2 3 4
Add numbering suffix starts with '_1' started with the first duplicated column and applicable to columns appearing more than once.
E.g a column name list: [a, b, c, a, b, a] will return [a, b, c, a_1, b_1, a_2]
from collections import Counter
counter = Counter()
empty_list= []
for x in range(df.shape[1]):
counter.update([df.columns[x]])
if counter[df.columns[x]] == 1:
empty_list.append(df.columns[x])
else:
tx = counter[df.columns[x]] -1
empty_list.append(df.columns[x] + '_' + str(tx))
df.columns = empty_list
df.columns

panda expand columns with list into multiple columns

I want to expand / cast a column that contains lists into multiple columns:
df = pd.DataFrame({'a':[1,2], 'b':[[11,22],[33,44]]})
# I want:
pd.DataFrame({'a':[1,2], 'b1':[11,33], 'b2':[22,44]})
Send the column .tolist and create the DataFrame, then join back to the other column(s).
df = pd.concat([df.drop(columns='b'), pd.DataFrame(df['b'].tolist(), index=df.index).add_prefix('b')],
axis=1)
a b0 b1
0 1 11 22
1 2 33 44
df = pd.DataFrame({'a':[1,2], 'b':[[11,22],[33,44]]})
df["b1"] = df["b"].apply(lambda cell: cell[0])
df["b2"] = df["b"].apply(lambda cell: cell[1])
df[["a", "b1", "b2"]]
You can use .tolist() on your "b" column to expand it out, then just assign it back to the dataframe and get rid of your original "b" column:
df = pd.DataFrame({'a':[1,2], 'b':[[11,22],[33,44]]})
df[["b1", "b2"]] = df["b"].tolist()
df = df.drop("b", axis=1) # alternatively: del df["b"]
print(df)
a b1 b2
0 1 11 22
1 2 33 44

Lookup row in pandas dataframe

I have two dataframes (A & B). For each row in A I would like to look up some information that is in B. I tried:
A = pd.DataFrame({'X' : [1,2]}, index=[4,5])
B = pd.DataFrame({'Y' : [3,4,5]}, index=[4,5,6])
C = pd .DataFrame(A.index)
C .columns = ['I']
C['Y'] = B .loc[C.I, 'Y']
I wanted '3, 4' but I got 'NaN', 'NaN'.
Use A.join(B).
The result is:
X Y
4 1 3
5 2 4
Joining is by index and value from B for key 5 is absent, since A does
not contain this key.
What you should do is make the index same , pandas is index sensitive , which mean they will check the index when do assignment
C = pd .DataFrame(A.index,index=A.index) # change here
C .columns = ['I']
C['Y'] = B .loc[C.I, 'Y']
C
Out[770]:
I Y
4 4 3
5 5 4
Or just modify your code adding .values at the end
C['Y'] = B .loc[C.I, 'Y'].values
Since you mentioned lookup let us using lookup
C['Y']=B.lookup(C.I,['Y']*len(C))
#Out[779]: array([3, 4], dtype=int64)

dataframe slicing with loc [duplicate]

How do I select columns a and b from df, and save them into a new dataframe df1?
index a b c
1 2 3 4
2 3 4 5
Unsuccessful attempt:
df1 = df['a':'b']
df1 = df.ix[:, 'a':'b']
The column names (which are strings) cannot be sliced in the manner you tried.
Here you have a couple of options. If you know from context which variables you want to slice out, you can just return a view of only those columns by passing a list into the __getitem__ syntax (the []'s).
df1 = df[['a', 'b']]
Alternatively, if it matters to index them numerically and not by their name (say your code should automatically do this without knowing the names of the first two columns) then you can do this instead:
df1 = df.iloc[:, 0:2] # Remember that Python does not slice inclusive of the ending index.
Additionally, you should familiarize yourself with the idea of a view into a Pandas object vs. a copy of that object. The first of the above methods will return a new copy in memory of the desired sub-object (the desired slices).
Sometimes, however, there are indexing conventions in Pandas that don't do this and instead give you a new variable that just refers to the same chunk of memory as the sub-object or slice in the original object. This will happen with the second way of indexing, so you can modify it with the .copy() method to get a regular copy. When this happens, changing what you think is the sliced object can sometimes alter the original object. Always good to be on the look out for this.
df1 = df.iloc[0, 0:2].copy() # To avoid the case where changing df1 also changes df
To use iloc, you need to know the column positions (or indices). As the column positions may change, instead of hard-coding indices, you can use iloc along with get_loc function of columns method of dataframe object to obtain column indices.
{df.columns.get_loc(c): c for idx, c in enumerate(df.columns)}
Now you can use this dictionary to access columns through names and using iloc.
As of version 0.11.0, columns can be sliced in the manner you tried using the .loc indexer:
df.loc[:, 'C':'E']
is equivalent to
df[['C', 'D', 'E']] # or df.loc[:, ['C', 'D', 'E']]
and returns columns C through E.
A demo on a randomly generated DataFrame:
import pandas as pd
import numpy as np
np.random.seed(5)
df = pd.DataFrame(np.random.randint(100, size=(100, 6)),
columns=list('ABCDEF'),
index=['R{}'.format(i) for i in range(100)])
df.head()
Out:
A B C D E F
R0 99 78 61 16 73 8
R1 62 27 30 80 7 76
R2 15 53 80 27 44 77
R3 75 65 47 30 84 86
R4 18 9 41 62 1 82
To get the columns from C to E (note that unlike integer slicing, E is included in the columns):
df.loc[:, 'C':'E']
Out:
C D E
R0 61 16 73
R1 30 80 7
R2 80 27 44
R3 47 30 84
R4 41 62 1
R5 5 58 0
...
The same works for selecting rows based on labels. Get the rows R6 to R10 from those columns:
df.loc['R6':'R10', 'C':'E']
Out:
C D E
R6 51 27 31
R7 83 19 18
R8 11 67 65
R9 78 27 29
R10 7 16 94
.loc also accepts a Boolean array so you can select the columns whose corresponding entry in the array is True. For example, df.columns.isin(list('BCD')) returns array([False, True, True, True, False, False], dtype=bool) - True if the column name is in the list ['B', 'C', 'D']; False, otherwise.
df.loc[:, df.columns.isin(list('BCD'))]
Out:
B C D
R0 78 61 16
R1 27 30 80
R2 53 80 27
R3 65 47 30
R4 9 41 62
R5 78 5 58
...
Assuming your column names (df.columns) are ['index','a','b','c'], then the data you want is in the
third and fourth columns. If you don't know their names when your script runs, you can do this
newdf = df[df.columns[2:4]] # Remember, Python is zero-offset! The "third" entry is at slot two.
As EMS points out in his answer, df.ix slices columns a bit more concisely, but the .columns slicing interface might be more natural, because it uses the vanilla one-dimensional Python list indexing/slicing syntax.
Warning: 'index' is a bad name for a DataFrame column. That same label is also used for the real df.index attribute, an Index array. So your column is returned by df['index'] and the real DataFrame index is returned by df.index. An Index is a special kind of Series optimized for lookup of its elements' values. For df.index it's for looking up rows by their label. That df.columns attribute is also a pd.Index array, for looking up columns by their labels.
In the latest version of Pandas there is an easy way to do exactly this. Column names (which are strings) can be sliced in whatever manner you like.
columns = ['b', 'c']
df1 = pd.DataFrame(df, columns=columns)
In [39]: df
Out[39]:
index a b c
0 1 2 3 4
1 2 3 4 5
In [40]: df1 = df[['b', 'c']]
In [41]: df1
Out[41]:
b c
0 3 4
1 4 5
With Pandas,
wit column names
dataframe[['column1','column2']]
to select by iloc and specific columns with index number:
dataframe.iloc[:,[1,2]]
with loc column names can be used like
dataframe.loc[:,['column1','column2']]
You can use the pandas.DataFrame.filter method to either filter or reorder columns like this:
df1 = df.filter(['a', 'b'])
This is also very useful when you are chaining methods.
You could provide a list of columns to be dropped and return back the DataFrame with only the columns needed using the drop() function on a Pandas DataFrame.
Just saying
colsToDrop = ['a']
df.drop(colsToDrop, axis=1)
would return a DataFrame with just the columns b and c.
The drop method is documented here.
I found this method to be very useful:
# iloc[row slicing, column slicing]
surveys_df.iloc [0:3, 1:4]
More details can be found here.
Starting with 0.21.0, using .loc or [] with a list with one or more missing labels is deprecated in favor of .reindex. So, the answer to your question is:
df1 = df.reindex(columns=['b','c'])
In prior versions, using .loc[list-of-labels] would work as long as at least one of the keys was found (otherwise it would raise a KeyError). This behavior is deprecated and now shows a warning message. The recommended alternative is to use .reindex().
Read more at Indexing and Selecting Data.
You can use Pandas.
I create the DataFrame:
import pandas as pd
df = pd.DataFrame([[1, 2,5], [5,4, 5], [7,7, 8], [7,6,9]],
index=['Jane', 'Peter','Alex','Ann'],
columns=['Test_1', 'Test_2', 'Test_3'])
The DataFrame:
Test_1 Test_2 Test_3
Jane 1 2 5
Peter 5 4 5
Alex 7 7 8
Ann 7 6 9
To select one or more columns by name:
df[['Test_1', 'Test_3']]
Test_1 Test_3
Jane 1 5
Peter 5 5
Alex 7 8
Ann 7 9
You can also use:
df.Test_2
And you get column Test_2:
Jane 2
Peter 4
Alex 7
Ann 6
You can also select columns and rows from these rows using .loc(). This is called "slicing". Notice that I take from column Test_1 to Test_3:
df.loc[:, 'Test_1':'Test_3']
The "Slice" is:
Test_1 Test_2 Test_3
Jane 1 2 5
Peter 5 4 5
Alex 7 7 8
Ann 7 6 9
And if you just want Peter and Ann from columns Test_1 and Test_3:
df.loc[['Peter', 'Ann'], ['Test_1', 'Test_3']]
You get:
Test_1 Test_3
Peter 5 5
Ann 7 9
If you want to get one element by row index and column name, you can do it just like df['b'][0]. It is as simple as you can imagine.
Or you can use df.ix[0,'b'] - mixed usage of index and label.
Note: Since v0.20, ix has been deprecated in favour of loc / iloc.
df[['a', 'b']] # Select all rows of 'a' and 'b'column
df.loc[0:10, ['a', 'b']] # Index 0 to 10 select column 'a' and 'b'
df.loc[0:10, 'a':'b'] # Index 0 to 10 select column 'a' to 'b'
df.iloc[0:10, 3:5] # Index 0 to 10 and column 3 to 5
df.iloc[3, 3:5] # Index 3 of column 3 to 5
Try to use pandas.DataFrame.get (see the documentation):
import pandas as pd
import numpy as np
dates = pd.date_range('20200102', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df.get(['A', 'C'])
One different and easy approach: iterating rows
Using iterows
df1 = pd.DataFrame() # Creating an empty dataframe
for index,i in df.iterrows():
df1.loc[index, 'A'] = df.loc[index, 'A']
df1.loc[index, 'B'] = df.loc[index, 'B']
df1.head()
The different approaches discussed in the previous answers are based on the assumption that either the user knows column indices to drop or subset on, or the user wishes to subset a dataframe using a range of columns (for instance between 'C' : 'E').
pandas.DataFrame.drop() is certainly an option to subset data based on a list of columns defined by user (though you have to be cautious that you always use copy of dataframe and inplace parameters should not be set to True!!)
Another option is to use pandas.columns.difference(), which does a set difference on column names, and returns an index type of array containing desired columns. Following is the solution:
df = pd.DataFrame([[2,3,4], [3,4,5]], columns=['a','b','c'], index=[1,2])
columns_for_differencing = ['a']
df1 = df.copy()[df.columns.difference(columns_for_differencing)]
print(df1)
The output would be:
b c
1 3 4
2 4 5
You can also use df.pop():
>>> df = pd.DataFrame([('falcon', 'bird', 389.0),
... ('parrot', 'bird', 24.0),
... ('lion', 'mammal', 80.5),
... ('monkey', 'mammal', np.nan)],
... columns=('name', 'class', 'max_speed'))
>>> df
name class max_speed
0 falcon bird 389.0
1 parrot bird 24.0
2 lion mammal 80.5
3 monkey mammal
>>> df.pop('class')
0 bird
1 bird
2 mammal
3 mammal
Name: class, dtype: object
>>> df
name max_speed
0 falcon 389.0
1 parrot 24.0
2 lion 80.5
3 monkey NaN
Please use df.pop(c).
I've seen several answers on that, but one remained unclear to me. How would you select those columns of interest?
The answer to that is that if you have them gathered in a list, you can just reference the columns using the list.
Example
print(extracted_features.shape)
print(extracted_features)
(63,)
['f000004' 'f000005' 'f000006' 'f000014' 'f000039' 'f000040' 'f000043'
'f000047' 'f000048' 'f000049' 'f000050' 'f000051' 'f000052' 'f000053'
'f000054' 'f000055' 'f000056' 'f000057' 'f000058' 'f000059' 'f000060'
'f000061' 'f000062' 'f000063' 'f000064' 'f000065' 'f000066' 'f000067'
'f000068' 'f000069' 'f000070' 'f000071' 'f000072' 'f000073' 'f000074'
'f000075' 'f000076' 'f000077' 'f000078' 'f000079' 'f000080' 'f000081'
'f000082' 'f000083' 'f000084' 'f000085' 'f000086' 'f000087' 'f000088'
'f000089' 'f000090' 'f000091' 'f000092' 'f000093' 'f000094' 'f000095'
'f000096' 'f000097' 'f000098' 'f000099' 'f000100' 'f000101' 'f000103']
I have the following list/NumPy array extracted_features, specifying 63 columns. The original dataset has 103 columns, and I would like to extract exactly those, then I would use
dataset[extracted_features]
And you will end up with this
This something you would use quite often in machine learning (more specifically, in feature selection). I would like to discuss other ways too, but I think that has already been covered by other Stack Overflower users.
To exclude some columns you can drop them in the column index. For example:
A B C D
0 1 10 100 1000
1 2 20 200 2000
Select all except two:
df[df.columns.drop(['B', 'D'])]
Output:
A C
0 1 100
1 2 200
You can also use the method truncate to select middle columns:
df.truncate(before='B', after='C', axis=1)
Output:
B C
0 10 100
1 20 200
To select multiple columns, extract and view them thereafter: df is the previously named data frame. Then create a new data frame df1, and select the columns A to D which you want to extract and view.
df1 = pd.DataFrame(data_frame, columns=['Column A', 'Column B', 'Column C', 'Column D'])
df1
All required columns will show up!
def get_slize(dataframe, start_row, end_row, start_col, end_col):
assert len(dataframe) > end_row and start_row >= 0
assert len(dataframe.columns) > end_col and start_col >= 0
list_of_indexes = list(dataframe.columns)[start_col:end_col]
ans = dataframe.iloc[start_row:end_row][list_of_indexes]
return ans
Just use this function
I think this is the easiest way to reach your goal.
import pandas as pd
cols = ['a', 'b']
df1 = pd.DataFrame(df, columns=cols)
df1 = df.iloc[:, 0:2]