Looking up values from one dataframe in specific row of another dataframe - pandas

I'm struggling with a bit of a complex (to me) lookup-type problem.
I have a dataframe df1 that looks like this:
Grade Value
0 A 16
1 B 12
2 C 5
And another dataframe (df2) where the values in one of the columns('Grade') from df1 forms the index:
Tier 1 Tier 2 Tier 3
A 20 17 10
B 16 11 3
C 7 6 2
I've been trying to write a bit of code that for each row in df1, look ups the row corresponding with 'Grade' in df2, finds the smallest value in df2 greater than 'Value', and returns the name of that column.
E.g. for the second row of df1, it looks up the row with index 'B' in df2: 16 is the smallest value greater than 12, so it returns 'Tier 1'. Ideal output would be:
Grade Value Tier
0 A 16 Tier 2
1 B 12 Tier 1
2 C 5 Tier 2
My novice, downloaded-Python-last-week attempt so far has been as follows, which is throwing up all manner of errors and doesn't even try returning the column name. Sorry also about the micro-ness of the question: any help appreciated!
for i, row in input_df1.iterrows():
Tier = np.argmin(df1['Value']<df2.loc[row,0:df2.shape[1]])

df2.loc[df1.Grade].eq(df1.Value, 0).idxmax(1)

Related

persistent column label in pandas dataframe

I have an issue where trying to work with pandas' indexing, this first happened on a larger set and i was able to recreate it in this dummy dataframe. Apologies if my table formatting is terrible, I don't know how to make it better visually.
Unnamed: 0 col1 col2 col3
0 Name Sun Mon Tue
1 one 1 2 1
2 two 4 4 3
3 three 2 1 1
4 four 1 5 5
5 five 1 5 5
6 six 5 1 1
7 seven 5 5 6
8 eight 5 3 4
9 nine 5 3 3
So what i am trying to do is to rename the 1st column label ('Unnamed: 0') to something meaningful, but then when i finally try to reset_index, the index "column" has the name "test" for some reason, while the first actual column gets the label "index".
df.rename({df.columns[0]: 'test'}, axis=1, inplace=True)
df.set_index('test', inplace=True)
dft = df.transpose()
dft
test Name one two three four five six seven eight nine
col1 Sun 1 4 2 1 1 5 5 5 5
col2 Mon 2 4 1 5 5 1 5 3 3
col3 Tue 1 3 1 5 5 1 6 4 3
First of all, if my understanding is correct, index is not even an actual column in the dataframe, why does it get to have a label when resetting index?
And more importantly, why are the labels "test" and "index" reversed?
dft.reset_index(inplace=True)
dft
test index Name one two three four five six seven eight nine
0 col1 Sun 1 4 2 1 1 5 5 5 5
1 col2 Mon 2 4 1 5 5 1 5 3 3
2 col3 Tue 1 3 1 5 5 1 6 4 3
I have tried every possible combination of set_index / reset_index i can think of, trying drop=True & inplace=True but i cannot find a way to create a proper index, like the one i started with.
Yes, the axis (index and column axis) can have names.
This is useful for multi-indexing.
When you call .reset_index, the index is extracted into a new column, which is named how your index was named (by default, 'index').
If you want, you can reset and rename index in one line:
df.rename_axis('Name').reset_index()
Why is 'test' printed not where I expect?
After your code, if you print(dft.columns), you will see:
Index(['index', 'Name', 'one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine'],
dtype='object',
name='test')
There are 11 columns. The column axis is called 'test' (see name='test' in the output above).
Also: print(dft.columns.name) prints test.
So what you actually see when you print your dataframe are the column names, to the left of which is the name of the column axis: 'test'.
It is NOT how the index axis is named. You can check: print(type(dft.index.name)) prints <class 'NoneType'>.
Now, why is column axis named 'test'?
Let's see how it got there step by step.
df.rename({df.columns[0]: 'test'}, axis=1, inplace=True)
First column is now named 'test'.
df.set_index('test', inplace=True)
First column has moved from being a column to being an index. The index name is 'test'. The old index disappeared.
dft = df.transpose()
The column axis is now named 'test'. The index is now named however the column axis was named before transposing.

How to split pandas dataframe into multiple dataframes (holding together rows) based upon a column's value

My problem is similar to split a dataframe into chunks of N rows problem, expect that the number of rows in each chunk will be different. I have a datafame as such:
A
B
C
1
2
0
1
2
1
1
2
2
1
2
0
1
2
1
1
2
2
1
2
3
1
2
4
1
2
0
A and B are just whatever don't pay attention. Column C though starts at 0 and increments with each row until it suddenly resets to 0. So in the dataframe included the first 3 rows are a new dataframe, then the next 5 are a second new dataframe, and this continues as my dataframe adds more and more rows.
To finish off the question,
df = [x for _, x in df.groupby(df['C'].eq(0).cumsum())]
allows me to group all the subgroups and then with this groupby I can select each subgroups as a separate dataframe.

Pandas Create Variability DF with Multiple Row Averages in Different DF

I've been trying to create a column of variability given the mean of the column data values for 'A' and 'B' below. I don't understand how to create the average for each row or element-wise in the panda column by the single data value with the long-term average(s). For example, imagine if have data that looks like this in pandas df1:
Year Name Data
1999 A 2
2000 A 4
1999 B 6
2000 B 8
And, i have a DF with the long-term mean called "LTmean", which in this case is = 3 and 7.
mean_df =
Name Data mean
0 A 3
1 B 7
So, the result would look like this for a new df: dfnew['var'] = (df1.['Data']/mean_df(???) -1:
Year Name Var
1999 A -0.3
2000 A 0.3
1999 B -0.14
2000 B 0.14
Thank you for any suggestions on this! Would a loop be the best idea to loop through each column by the "Name' in each DF somehow?
df['Var'] = df1['Data']/LTmean - 1

Lookup a pandas df for a column value by matching rows with another dataframe

Say I have a pandas dataframe df1 as follows:
OpDay Rid Tid Sid Dist
0 18Sep 1 1 1 10
1 18Sep 1 1 1 15
2 18Sep 1 1 1 20
3 18Sep 1 5 4 5
4 18Sep 1 5 4 50
and df2 like:
S_Day R_ID T_ID S_ID ABC XYZ
0 18Sep 1 1 1 100 60
1 18Sep 1 5 4 125 100
Number of rows in df2 is equal to total number of unique combinations of OpDay+Rid+Tid+Sid in df1.
Now, I want the values of columns ABC and XYZ from df2 corresponding to this each unique combination. But I don't want to store these values in df1. Just need these values for some computation purpose and then I want to store the result in df2 only by creating a new column.
To summarize, lets say ,I want to do some computation using df1.Dist[3] for which I need values from columns df2.ABC and df2.XYZ also, so first find the row index in df2 where,
S_Day = OpDay[3],
R_ID = Rid[3],
T_ID = Tid[3] and
S_ID = Sid[3]
(In this case its row#1),
so use df2.ABC[1] and df2.XYZ[1] and store results in df2.RESULT[1].
So now df2 will look something like:
S_Day R_ID T_ID S_ID ABC XYZ RESULT
0 18Sep 1 1 1 100 60 Nan
1 18Sep 1 5 4 125 100 some computed value
Basically I guess I need a lookup kind of a function but don't know how to proceed further.
Please help as I am new to the world of python and programming. Many thanks in advance.
You can use .loc and Boolean indices to do what you want. Let's say that you're after the ith row of df1:
i = 3
Next, you can use Boolean indexing to find the corresponding rows in df2:
bool_index = (df1.loc[i, 'OpDay'] == df2.loc[:, 'S_Day']) & (df1.loc[i, 'Rid'] == df2.loc[:, 'R_ID']) & (df1.loc[i, 'Tid'] == df2.loc[:, 'T_ID']) & (df1.loc[i, 'Sid'] == df2.loc[:, 'S_ID'])
You might want to include a check to verify that you found one and only one combination:
sum(bool_index) == 1
And finally, you can use the boolean index to call the right values from df2:
ABC_for_computation = df2.loc[bool_index, 'ABC']
XYZ_for_computation = df2.loc[bool_index, 'XYZ']
Note that I'm not too sure about the speed of this operation if you have large datasets. In my experience, if speed is affected you should switch to numpy arrays instead of dataframes, particularly when writing data into your dataframe.

How to check dependency of one column to another in a pandas dataframe

I have the following dataframe:
import pandas as pd
df=pd.DataFrame([[1,11,'a'],[1,12,'a'],[1,11,'a'],[1,12,'a'],[1,7,'a'],
[1,12,'a']])
df.columns=['id','code','name']
df
id code name
0 1 11 a
1 1 12 a
2 1 11 a
3 1 12 a
4 1 7 a
5 1 12 a
As shown in the above dataframe, the value of column "id" is directly related to the value of column "name". If I have say, a million records, how can I know that a column is totally dependent on other column in a dataframe?
If they are totally dependent, then their factorizations will be the same
(df.id.factorize()[0] == df.name.factorize()[0]).all()
True