How to sort a dataframe by a multiindex level? [duplicate] - pandas

This question already has answers here:
Sorting columns of multiindex dataframe
(2 answers)
Closed 7 months ago.
I have a pandas dataframe with a multiindex with various data in it. Minimal example could be this one:
elev = [1, 100, 10, 1000]
number = [4, 3, 1, 2]
name = ['foo', 'bar', 'baz', 'qux']
idx = pd.MultiIndex.from_arrays([name, elev, number],
names=('name','elev', 'number'))
data = np.random.rand(4,4)
df = pd.DataFrame(data=data, columns=idx)
Now I want to sort if by its elevation or number. Seems like there's an inbuilt function for it: MultiIndex.sortlevel, but it just sorts the MultiIndex, and I can't figure out how to make it sort the dataframe along the index too.
df.columns.sortlevel(level=1) gives me a sorted Multiindex
(MultiIndex([('foo', 1, 4),
('baz', 10, 1),
('bar', 100, 3),
('qux', 1000, 2)],
names=['name', 'elev', 'number']),
array([0, 2, 1, 3], dtype=int64))
but trying to apply it with df.columns = df.columns.sortlevel(level=1) or df = ... just gives me ValueError: Length mismatch: Expected axis has 4 elements, new values have 2 elements or turns the df into the sorted multiindex. The keywords axis or inplace I'm used to for similar actions aren't supported in sortlevel.
How do I apply my sorting to my dataframe?

Use DataFrame.sort_index:
df = df.sort_index(level=1, axis=1)
print (df)
name foo baz bar qux
elev 1 10 100 1000
number 4 1 3 2
0 0.009359 0.113384 0.499058 0.049974
1 0.685408 0.897657 0.486988 0.647452
2 0.896963 0.831353 0.721135 0.827568
3 0.833580 0.368044 0.957044 0.494838

Related

pandas read dataframe multi-header values

I have this dataframe with multiple headers
name, 00590BL, 01090BL, 01100MS, 02200MS
lat, 613297, 626278, 626323, 616720
long, 5185127, 5188418, 5188431, 5181393
elv, 1833, 1915, 1915, 1499
1956-01-01, 1, 2, 2, -2
1956-01-02, 2, 3, 3, -1
1956-01-03, 3, 4, 4, 0
1956-01-04, 4, 5, 5, 1
1956-01-05, 5, 6, 6, 2
I read this as
dfr = pd.read_csv(f_name,
skiprows = 0,
header = [0,1,2,3],
index_col = 0,
parse_dates = True
)
I would like to extract the value related the rows named 'lat' and 'long'.
A easy way, could be to read the dataframe in two step. In other words, the idea could be have two dataframes. I do not like this because it is not very elegant and it not seems to take advantage of pandas potentiality. I believe that I could use some feature related to multi-index.
what do you think?
You can use get_level_values:
dfr = pd.read_csv(f_name, skiprows=0, header=[0, 1, 2, 3], index_col=0,
parse_dates=[0], skipinitialspace=True)
lat = df.columns.get_level_values('lat').astype(int)
long = df.columns.get_level_values('long').astype(int)
elv = df.columns.get_level_values('elv').astype(int)
Output:
>>> lat.to_list()
[613297, 626278, 626323, 616720]
>>> long.to_list()
[5185127, 5188418, 5188431, 5181393]
>>> elv.to_list()
[1833, 1915, 1915, 1499]
If you only need the first row of column header, use droplevel
df = dfr.droplevel(['lat', 'long', 'elv'], axis=1).rename_axis(columns=None))
print(df)
# Output
00590BL 01090BL 01100MS 02200MS
1956-01-01 1 2 2 -2
1956-01-02 2 3 3 -1
1956-01-03 3 4 4 0
1956-01-04 4 5 5 1
1956-01-05 5 6 6 2
One way to do this is to use the .loc method to select the rows by their label. For example, you could use the following code to extract the 'lat' values:
lat_values = dfr.loc['lat']
And similarly, you could use the following code to extract the 'long' values:
long_values = dfr.loc['long']
Alternatively, you can use the .xs method to extract the values of the desired level.
lat_values = dfr.xs('lat', level=1, axis=0) long_values = dfr.xs('long', level=1, axis=0)
Both these approach will extract the values for 'lat' and 'long' rows from the dataframe and will allow you to access it as one dataframe with one index.

Pandas - find rows sharing two out the three common values, order-independent, and collect values pairs

Given a dataframe, I am looking for rows where two out of three values are in common, regardless of the columns, hence order, in which they appear. I would like to then collect those common pairs.
Please note
a couple of values can appear at most in two rows
a value can appear only once in a row
I would like to know what the most efficient/elegant way is in numpy or pandas to solve this problem.
For example, taking as input the dataframe
d = {'col1': [1, 2,5,1], 'col2': [1, 7,1,2],'col3': [3, 3,1,7]}
df = pd.DataFrame(data=d)
col1 col2 col3
0 1 2 3
1 2 7 3
2 5 1 2
3 9 2 7
I expect as result an array, list, something as
1 2
2 3
2 7
as the values (1,2) , (2,3) and (2,7) are present in two rows (first and third, first and second, and second and forth respectively).
I cannot find a concise solution.
At the moment I skecthed a numpy solution such as
def func(x):
rows, columns = x.shape[0], x.shape[1]
res = []
for i in range(0,rows):
for j in range(i+1, rows):
aux = np.intersect1d(x[i,:], x[j,:])
if aux.size>1:
res.append(aux)
return res
which outputs
func(df.values)
Out: [array([2, 3]), array([1, 2]), array([2, 7])]
It looks well cumbersome, how could get it done with one of those cool numpy/pandas one-liners?
I would suggest using python built in set operations to do most of the heavy lifting, just apply them with pandas:
import itertools
import pandas as pd
d = {'col1': [1, 2,5,9], 'col2': [2, 7,1,2],'col3': [3, 3,2,7]}
df = pd.DataFrame(data=d)
pairs = df.apply(set, axis=1).apply(lambda x: set(itertools.combinations(x, 2))).explode()
out = set(pairs[pairs.duplicated()])
Output:
{(2, 3), (1, 2), (2, 7)}
Optionally to get it in list[np.ndarray] format:
out = list(map(np.array, out))
Similar approach to that of #Chrysophylaxs but in pure python:
from itertools import combinations
from collections import Counter
c = Counter(s for x in df.to_numpy().tolist() for s in set(combinations(set(x), r=2)))
out = [k for k,v in c.items() if v>1]
# [(2, 3), (1, 2), (2, 7)]
df=df.assign(col4=df.index)
def function1(ss:pd.Series):
ss1=ss.value_counts().loc[lambda ss:ss>=2]
return ss1.index.tolist() if ss1.size>=2 else None
df.merge(df,how='cross',suffixes=('','_2')).query("col4!=col4_2").filter(regex=r'col[^4]', axis=1)\
.apply(function1,axis=1).dropna().drop_duplicates()
out
1 [2, 3]
2 [1, 2]
7 [2, 7]

Pandas - Merge data frames based on conditions

I would like to merge n data frames based on certain variables (external to the data frame).
Let me clarify the problem referring to an example.
We have two dataframes detailing the height and age of certain members of a population.
On top, we are given one array per data frame, containing one value per property (so array length = number of columns with numerical value in the data frame).
Consider the following two data frames
df1 = pd.DataFrame({'Name': ['A', 'B', 'C', 'D', 'E'],
'Age': [3, 8, 4, 2, 5], 'Height': [7, 2, 1, 4, 9]})
df2 = pd.DataFrame({'Name': ['A', 'B', 'D'],
'Age': [4, 6, 4], 'Height': [3,9, 2]})
looking as
( Name Age Height
0 A 3 7
1 B 8 2
2 C 4 1
3 D 2 4
4 E 5 9,
Name Age Height
0 A 4 3
1 B 6 9
2 D 4 2)
As mentioned, we also have two arrays, say
array1 = np.array([ 1, 5])
array2 = np.array([2, 3])
To make the example concrete, let us say each array contains the year in which the property was measured.
The output should be constructed as follows:
if an individual appears only in one dataframe, its properties are taken from said dataframe
if an individual appears in more than one data frame, for each property take the values from the data frame whose associated array has the corresponding higher value. So, for property i, compare array1[[i]] and array2[[i]], and take property values from dataframe df1 if array1[[i]] > array2[[i]], and viceversa.
In the context of the example, the rules are translated as, take the property which has been measured more recently, if more are available
The output given the example data frames should look like
Name Age Height
0 A 4 7
1 B 6 2
2 C 4 1
3 D 4 4
4 E 5 9
Indeed, for the first property "Age", as array1[[0]] < array2[[0]], values are taken from the second dataframe, for the available individuals (A, B, D). Remaining values come from the first dataframe.
For the second property "Height", as as array1[[1]] > array2[[1]], values come from the first dataframe, which already describes all the individuals.
At the moment I have some sort of solution based on looping over properties, but it is silly convoluted, I am wondering if any Pandas expert out there could help me towards an elegant solution.
Thanks for your support.
Your question is a bit confusing: array indexes start from 0 so I think in your example it should be [[0]] and [[1]] instead of [[1]] and [[2]].
You can first concatenate your dataframes to have all names listed, then loop over your columns and update the values where the corresponding array is greater (I added a Z row to df2 to show new rows are being added):
df1 = pd.DataFrame({'Name': ['A', 'B', 'C', 'D', 'E'],
'Age': [3, 8, 4, 2, 5], 'Height': [7, 2, 1, 4, 9]})
df2 = pd.DataFrame({'Name': ['A', 'B', 'D', 'Z'],
'Age': [4, 6, 4, 8], 'Height': [3,9, 2, 7]})
array1 = np.array([ 1, 5])
array2 = np.array([2, 3])
df1.set_index('Name', inplace=True)
df2.set_index('Name', inplace=True)
df3 = pd.concat([df1, df2[~df2.index.isin(df1.index)]])
for i, col in enumerate(df1.columns):
if array2[[i]] > array1[[i]]:
df3[col].update(df2[col])
print(df3)
Note: You have to set Name as index in order to update the right rows
Output:
Age Height
Name
A 4 7
B 6 2
C 4 1
D 4 4
E 5 9
Z 8 7
I you have more than two dataframes in a list, you'll have to store your arrays in a list as well and iterate over the dataframe list while keeping track of the highest array values in a new array.

How to select rows of a dataframe according to the list of ids? [duplicate]

This question already has answers here:
Select rows from a DataFrame based on multiple values in a column in pandas [duplicate]
(1 answer)
How do I select rows from a DataFrame based on column values?
(16 answers)
Closed 1 year ago.
I have the following dataframe and data list, respectively:
import pandas as pd
df = pd.DataFrame({'ID': [1, 2, 4, 7, 30],
'Instrument': ['temp_sensor', 'temp_sensor', 'temp_sensor',
'strain_gauge', 'light_sensor'],
'Value': [1000, 0, 1000, 0, 1000]})
print(df)
ID Instrument Value
1 temp_sensor 1000
2 temp_sensor 0
4 temp_sensor 1000
7 strain_gauge 0
30 light_sensor 1000
list_ID = [2, 30]
I would like to generate a new dataframe that corresponds to the dataframe df, but that it would receive only the lines where the ID belongs to list_ID.
I tried to implement the following code. However, it is not working:
d = {'ID':[], 'Instrument':[], 'Value':[]}
df_aux = pd.DataFrame(d)
for j in range(0, len(df)):
for k in range(0, len(list_ID)):
if(df['ID'][j] == list_ID[k]):
df_aux.append(df[df['ID'][j] == list_ID[k]])
The error appears: KeyError: True
I would like the output of df_aux to be:
ID Instrument Value
2 temp_sensor 0
30 light_sensor 1000

Pandas: how to retrieve values from a DataFrame given a list of (row, column) pairs?

tldr; I want to pass a series of positions on a DataFrame and receive a series of values, If possible with a DataFrame method.
I have a Dataframe with some columns and an index
import pandas as pd
df_a = pd.DataFrame(
{'A':[0,1,3,7],
'B':[2,3,4,5]}, index=[0,1,2,3])
I want to retrieve the values at specific (row, column) positions on the DataFrame
rows = [0, 2, 3]
cols = ['A','B','A']
df_a.loc[rows, cols] returns a 3x3 DataFrame
|A |B |A
0 0 2 0
2 3 4 3
3 7 5 7
I want the series of values corresponding to the (row, col) values, a series of length 3
[0, 4, 7]
What is the best way to do this in pandas?
Most certainly! you can use DataFrame.lookup to achieve exactly what you want:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.lookup.html
import pandas as pd
df_a = pd.DataFrame({'A':[0,1,3,7], 'B':[2,3,4,5]}, index=[0,1,2,3])
rows = [0, 2, 3]
cols = ['A','B','A']
values = df_a.lookup(rows, cols)
print(values)
array([0, 4, 7], dtype=int64)
Pandas does not support that kind of indexing, only numpy
>>> df.to_numpy()[rows, df.columns.get_indexer(cols)]
array([0, 4, 7])