How to Select Rows and Columns of Dataframe? - pandas

I have a dataframe with 4 columns and 6 rows, I want to select 2nd and 4th columns and 1st and 6th rows of this dataframe and create a new dataframe. How can i do this?

You can use the code given below, to do this but you have to be careful with the fact that in pandas the indexing starts from 0 which is the first column or could be first row when you are retrieving it.
>>> import pandas as pd
>>>
>>> dictA = {'col1': ['tom', 10,20,56,2,3,4],'col2': ['tom', 10,20,56,2,3,4],'col3': ['tom', 10,20,56,2,3,4],'col4': ['tom', 10,20,56,2,3,4]}
>>>
... dfA = pd.DataFrame(dictA)
>>> dfA
col1 col2 col3 col4
0 tom tom tom tom
1 10 10 10 10
2 20 20 20 20
3 56 56 56 56
4 2 2 2 2
5 3 3 3 3
6 4 4 4 4
>>> new_df = dfA.iloc[[0,5],[2,3]]
>>> new_df
col3 col4
0 tom tom
5 3 3
For more details have a look here

Related

Convert multiple columns in pandas dataframe to array of arrays

I have the following dataframe:
col1 col2 col3
1 1 2 3
2 4 5 6
3 7 8 9
4 10 11 12
I want to create a new column that will be an array of arrays, that contains a single array consisting of specific columns, casted to float.
So given column names, say "col2" and "col3", the output dataframe would look like this.
col1 col2 col3 new
1 1 2 3 [[2,3]]
2 4 5 6 [[5,6]]
3 7 8 9 [[8,9]]
4 10 11 12 [[11,12]]
What I have so far works, but seems clumsy and I believe there's a better way. I'm fairly new to pandas and numpy.
selected_columns = ["col2", "col3"]
df[selected_columns] = df[selected_columns].astype(float)
df['new'] = df.apply(lambda r: tuple(r[selected_columns]), axis=1)
.apply(np.array)
.apply(lambda r: tuple(r[["new"]]), axis=1)
.apply(np.array)
Appreciate your help, Thanks!
Using agg:
cols = ['col2', 'col3']
df['new'] = df[cols].agg(list, axis=1)
Using numpy:
df['new'] = df[cols].to_numpy().tolist()
Output:
col1 col2 col3 new
1 1 2 3 [2, 3]
2 4 5 6 [5, 6]
3 7 8 9 [8, 9]
4 10 11 12 [11, 12]
2D lists
cols = ['col2', 'col3']
df['new'] = df[cols].agg(lambda x: [list(x)], axis=1)
# or
df['new'] = df[cols].to_numpy()[:,None].tolist()
Output:
col1 col2 col3 new
1 1 2 3 [[2, 3]]
2 4 5 6 [[5, 6]]
3 7 8 9 [[8, 9]]
4 10 11 12 [[11, 12]]

pandas rolling calc count function

import pandas as pd
import numpy as np
df = pd.DataFrame({'col1': np.arange(6), 'col2': np.arange(2, 8)})
col1
col2
0
3
1
4
2
5
3
6
4
7
and i want get column col3 if condition with after rolling.tail(3) and re turn count(col1>=3 and col2>=3)
the last result i want it likes:
col1
col2
col3
reason
0
3
0
1
4
1
[(3>=3 and 6>=3)]
2
5
2
[(3>=3 and 6>=3),(4>=3 and 7>=3)]
3
6
nan
4
7
nan
Hope to get your reply as soon as possible

Add values in columns of multiple dataframes if values in another column are same

Question related to pandas dataframe
df1:
id count
1 3
2 7
3 11
df2:
id count
3 6
4 8
5 2
df3:
id count
2 1
4 3
6 9
Expected output df:
id count
1 3
2 8
3 17
4 11
5 2
6 9
Any help is appreciated &
Thanks in advance!
Use concat and aggregate sum:
df = pd.concat([df1, df2, df3]).groupby('id', as_index=False).sum()

Combine multiple columns into two columns: "column name" and "value"

There is probably an easy way of doing this, so I hope someone has a nice solution (currently I am doing it with ugly for loops).
My data looks like:
In [1]: df = pd.DataFrame({'Ref': [5, 6, 7],
'Col1': [10,11,12],
'Col2': [20,21,22],
'Col3': [30,31,32]})
In [2]: df
Out[2]:
Col1 Col2 Col3 Ref
0 10 20 30 5
1 11 21 31 6
2 12 22 32 7
And I am trying to flatten the table (for 2D histograms) to use a single column for the column id and one column for the actual values while keeping the corresponding Ref, like this:
Ref Col Value
0 5 1 10
1 5 2 20
2 5 3 30
3 6 1 11
4 6 2 21
5 6 3 31
6 7 1 12
7 7 2 22
8 7 3 32
I remember there was some kind of a join/group operation to do the reverse operation, but I cannot recall it anymore...
Maybe not the most elegant solution, but it works on your data. Using a combination of pivot_table and stack.
import pandas as pd
df = pd.DataFrame({'Ref': [5, 6, 7],
'Col1': [10,11,12],
'Col2': [20,21,22],
'Col3': [30,31,32]})
# In [23]: df
# Out[23]:
# Col1 Col2 Col3 Ref
# 0 10 20 30 5
# 1 11 21 31 6
# 2 12 22 32 7
piv = df.pivot_table(index=['Ref']).stack()
df2 = pd.DataFrame(piv)
df2.reset_index(inplace=True)
df2.columns = ['Ref','Col','Value']
# In [19]: df2
# Out[19]:
# Ref Col Value
# 0 5 Col1 10
# 1 5 Col2 20
# 2 5 Col3 30
# 3 6 Col1 11
# 4 6 Col2 21
# 5 6 Col3 31
# 6 7 Col1 12
# 7 7 Col2 22
# 8 7 Col3 32
If you want 'Col' to just be the last digit of the column name, could do something like this:
df2.Col = df2.Col.apply(lambda x: x[-1:])
# In [21]: df2
# Out[21]:
# Ref Col Value
# 0 5 1 10
# 1 5 2 20
# 2 5 3 30
# 3 6 1 11
# 4 6 2 21
# 5 6 3 31
# 6 7 1 12
# 7 7 2 22
# 8 7 3 32

In pandas, how to set_index with using column index instead of referring to column names?

For example:
We have a Pandas dataFrame foo with 2 columns ['A', 'B'].
I want to do function like
foo.set_index([0,1])
instead of
foo.set_index(['A', 'B'])
Have tried foo.set_index([[0,.1]]) as well but came with this error:
Length mismatch: Expected axis has 9 elements, new values have 2 elements
If the column index is unique you could use:
df.set_index(list(df.columns[cols]))
where cols is a list of ordinal indices.
For example,
In [77]: np.random.seed(2016)
In [79]: df = pd.DataFrame(np.random.randint(10, size=(5,4)), columns=list('ABCD'))
In [80]: df
Out[80]:
A B C D
0 3 7 2 3
1 8 4 8 7
2 9 2 6 3
3 4 1 9 1
4 2 2 8 9
In [81]: df.set_index(list(df.columns[[0,2]]))
Out[81]:
B D
A C
3 2 7 3
8 8 4 7
9 6 2 3
4 9 1 1
2 8 2 9
If the DataFrame's column index is not unique, then setting the index by label
is impossible and by ordinals more complicated:
import numpy as np
import pandas as pd
np.random.seed(2016)
def set_ordinal_index(df, cols):
columns, df.columns = df.columns, np.arange(len(df.columns))
mask = df.columns.isin(cols)
df = df.set_index(cols)
df.columns = columns[~mask]
df.index.names = columns[mask]
return df
df = pd.DataFrame(np.random.randint(10, size=(5,4)), columns=list('AAAA'))
print(set_ordinal_index(df, [0,2]))
yields
A A
A A
3 2 7 3
8 8 4 7
9 6 2 3
4 9 1 1
2 8 2 9
This worked for me, the other answer didn't.
# single column
df.set_index(df.columns[1])
# multi column
df.set_index(df.columns[[1, 0]].tolist())