Skip first column - pandas

Quite simple question I hope.Basically I want the same output without the first column.
import pandas as pd
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils','Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year':
[2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
df.loc[df['Team']=='Riders'].values.tolist()
Out [1]:
[['Riders', 1, 2014, 876],
['Riders', 2, 2015, 789],
['Riders', 2, 2016, 694],
['Riders', 2, 2017, 690]]
I want my output to be:
Out [1]:
[[1, 2014, 876],
[2, 2015, 789],
[ 2, 2016, 694],
[2, 2017, 690]]

You can do this,
df.loc[df['Team']=='Riders', ['Rank', 'Year', 'Points'] ].values.tolist()
Or if you want to select the columns without explicitly specifying column names,
columns = df.columns.values.tolist()[1:]
df.loc[df['Team']=='Riders', columns].values.tolist()

Use:
df.loc[df.Team == "Riders", df.columns[1:]].to_numpy().tolist()
to_numpy() is recommended instead of values according to pandas documentation. Both gives you numpy arrays so you can still use tolist().

You can select by positions all columns without first by DataFrame.iloc:
#pandas 0.24+
print (df.iloc[(df['Team']=='Riders').to_numpy(), 1:].to_numpy().tolist())
#oldier pandas versions
#print (df.iloc[(df['Team']=='Riders').values, 1:].values.tolist())
[[1, 2014, 876], [2, 2015, 789], [2, 2016, 694], [2, 2017, 690]]

Related

Finding values from different rows in pandas

I have a dataframe comprising the data and another dataframe, containing a single row carrying indices.
data = {'col_1': [4, 5, 6, 7], 'col_2': [3, 4, 9, 8],'col_3': [5, 5, 6, 9],'col_4': [8, 7, 6, 5]}
df = pd.DataFrame(data)
ind = {'ind_1': [2], 'ind_2': [1],'ind_3': [3],'ind_4': [2]}
ind = pd.DataFrame(ind)
Both have the same number of columns. I want to extract the values of df corresponding to the index stored in ind so that I get a single row at the end.
For this data it should be: [6, 4, 9, 6]. I tried df.loc[ind.loc[0]] but that of course gives me four different rows, not one.
The other idea I have is to zip columns and rows and iterate over them. But I feel there should be a simpler way.
you can go to NumPy domain and index there:
In [14]: df.to_numpy()[ind, np.arange(len(df.columns))]
Out[14]: array([[6, 4, 9, 6]], dtype=int64)
this pairs up 2, 1, 3, 2 from ind and 0, 1, 2, 3 from 0 to number of columns - 1; so we get the values at [2, 0], [1, 1] and so on.
There's also df.lookup but it's being deprecated, so...
In [19]: df.lookup(ind.iloc[0], df.columns)
~\Anaconda3\Scripts\ipython:1: FutureWarning: The 'lookup' method is deprecated and will beremoved in a future version.You can use DataFrame.melt and DataFrame.locas a substitute.
Out[19]: array([6, 4, 9, 6], dtype=int64)

How to convert list of pandas._libs.tslibs.timestamps.Timestamp to datetime.datetime?

I have a list panndas._libs.tslibs.timestamps.Timestamp.
I need to convert this to datetime.datetime of pandas.core.series.Series
to_pydatetime() is working for 1 row but not whole column.
df = pd.DataFrame({"year": [2015, 2016],
"month": [2, 3],
"day": [4, 5],
"hour": [2, 3]})
df=pd.to_datetime(df)
type(df.loc[0])
Out: pandas._libs.tslibs.timestamps.Timestamp
I want to change DF to datetime.datetime of of pandas.core.series.Series
my desired output is as below
df
Out:
0 2015-02-04 02:00:00
1 2016-03-05 03:00:00
df.loc[0]
out: datetime.datetime(2015, 2, 4, 2, 0)
df.loc[1]
out: datetime.datetime(2016, 3, 5, 3, 0)
My question above may look strange as I start from Dataframe and changed to pandas._libs.tslibs.timestamps.Timestamp and trying to go back to Dataframe.
However, I used this code just as example.
What I actually have is dataframe imported from excel file and did computation using pandas.tseries and finally got pandas._libs.tslibs.timestamps.Timestamp.
I need to convert this to datetime.datetime. I couldn't post all my excel files and others.
So I took above code as example.

NumPy: generalize one-hot encoding to k-hot encoding

I'm using this code to one-hot encode values:
idxs = np.array([1, 3, 2])
vals = np.zeros((idxs.size, idxs.max()+1))
vals[np.arange(idxs.size), idxs] = 1
But I would like to generalize it to k-hot encoding (where shape of vals would be same, but each row can contain k ones).
Unfortunatelly, I can't figure out how to index multiple cols from each row. I tried vals[0:2, [[0, 1], [3]] to select first and second column from first row and third column from second row, but it does not work.
It's called advanced-indexing.
to select first and second column from first row and third column from second row
You just need to pass the respective rows and columns in separate iterables (tuple, list):
In [9]: a
Out[9]:
array([[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9]])
In [10]: a[[0, 0, 1],[0, 1, 3]]
Out[10]: array([0, 1, 8])

Numpy Indexing Behavior

I am having a lot of trouble understanding numpy indexing for multidimensional arrays. In this example that I am working with, let's say that I have a 2D array, A, which is 100x10. Then I have another array, B, which is a 100x1 1D array of values between 0-9 (indices for A). In MATLAB, I would use A(sub2ind(size(A), 1:size(A,1)', B) to return for each row of A, the value at the index stored in the corresponding row of B.
So, as a test case, let's say I have this:
A = np.random.rand(100,10)
B = np.int32(np.floor(np.random.rand(100)*10))
If I print their shapes, I get:
print A.shape returns (100L, 10L)
print B.shape returns (100L,)
When I try to index into A using B naively (incorrectly)
Test1 = A[:,B]
print Test1.shape returns (100L, 100L)
but if I do
Test2 = A[range(A.shape[0]),B]
print Test2.shape returns (100L,)
which is what I want. I'm having trouble understanding the distinction being made here. In my mind, A[:,5] and A[range(A.shape[0]),5] should return the same thing, but it isn't here. How is : different from using range(sizeArray) which just creates an array from [0:sizeArray] inclusive, to use an indices?
Let's look at a simple array:
In [654]: X=np.arange(12).reshape(3,4)
In [655]: X
Out[655]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
With the slice we can pick 3 columns of X, in any order (and even repeated). In other words, take all the rows, but selected columns.
In [656]: X[:,[3,2,1]]
Out[656]:
array([[ 3, 2, 1],
[ 7, 6, 5],
[11, 10, 9]])
If instead I use a list (or array) of 3 values, it pairs them up with the column values, effectively picking 3 values, X[0,3],X[1,2],X[2,1]:
In [657]: X[[0,1,2],[3,2,1]]
Out[657]: array([3, 6, 9])
If instead I gave it a column vector to index rows, I get the same thing as with the slice:
In [659]: X[[[0],[1],[2]],[3,2,1]]
Out[659]:
array([[ 3, 2, 1],
[ 7, 6, 5],
[11, 10, 9]])
This amounts to picking 9 individual values, as generated by broadcasting:
In [663]: np.broadcast_arrays(np.arange(3)[:,None],np.array([3,2,1]))
Out[663]:
[array([[0, 0, 0],
[1, 1, 1],
[2, 2, 2]]),
array([[3, 2, 1],
[3, 2, 1],
[3, 2, 1]])]
numpy indexing can be confusing. But a good starting point is this page: http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html

Compute unique groups from Pandas group-by results

I'd like to count the unique groups from the result of a Pandas group-by operation. For instance here is an example data frame.
In [98]: df = pd.DataFrame({'A': [1,2,3,1,2,3], 'B': [10,10,11,10,10,15]})
In [99]: df.groupby('A').groups
Out[99]: {1: [0, 3], 2: [1, 4], 3: [2, 5]}
The conceptual groups are {1: [10, 10], 2: [10, 10], 3: [11, 15]} where the index locations in the groups above are substituded with the values from column B, but the first problem I've run into is how to convert those positions (e.g. [0, 3]) into values from the B column.
Given the ability to convert the groups into the value groups from column B I can compute the unique groups by hand, but a secondary question here is if Pandas has a built-in routine for this, which I haven't seen.
Edit updated with target output:
This is the output I would be looking for in the simplest case:
{1: [10, 10], 2: [10, 10], 3: [11, 15]}
And counting the unique groups would produce something equivalent to:
{[10, 10]: 2, [11, 15]: 1}
How about:
>>> df = pd.DataFrame({'A': [1,2,3,1,2,3], 'B': [10,10,11,10,10,15]})
>>> df.groupby("A")["B"].apply(tuple).value_counts()
(10, 10) 2
(11, 15) 1
dtype: int64
or maybe
>>> df.groupby("A")["B"].apply(lambda x: tuple(sorted(x))).value_counts()
(10, 10) 2
(11, 15) 1
dtype: int64
if you don't care about the order within the group.
You can trivially call .to_dict() if you'd like, e.g.
>>> df.groupby("A")["B"].apply(tuple).value_counts().to_dict()
{(11, 15): 1, (10, 10): 2}
maybe:
>>> df.groupby('A')['B'].aggregate(lambda ts: list(ts.values)).to_dict()
{1: [10, 10], 2: [10, 10], 3: [11, 15]}
for counting the groups you need to convert to tuple because lists are not hashable:
>>> ts = df.groupby('A')['B'].aggregate(lambda ts: tuple(ts.values))
>>> ts.value_counts().to_dict()
{(11, 15): 1, (10, 10): 2}