Pandas retrieve a value based on index - pandas

Been trying to search for this but somehow can't seem to find the right answer.
Given the following simple dataframe:
country continent population
0 UK Europe 111111
1 Spain Europe 222222
2 Malaysia Asia 333333
3 USA America 444444
How can I retrieve the country value if I have a condition WHERE an index value is given? For example, If I am given an index value of 2, I should return Malaysia.
Edit: Forget to mention that the input index value comes from a variable (think of it as a user select a particular row and the selected row provide an index value variable).
Thank you.

df.iloc[2]['country']
iloc is used for selection by position, see pandas.DataFrame.iloc documentation for further options.

index = 2
print(df.iloc[index]['country'])
Malaysia

Related

End user column sort - PowerBi

Wanted to know if there is a way where the end user can custom sort a column
For example we can add a conditional column sort like below and can use 'sort by column' option to custom sort a column. Is there a way where we can add a parameter in M and end user can switch between different sorting for the column? For example if he
Zone |Sort
North | 1
South | 2
Central | 3
East | 4
West | 5
There is a new Dynamic M query parameter in preview in power bi march update but couldn't make it to work
If it cannot be achieved via parameters then what would be the best approach?

How to separate entries, and count the occurrences

I'm trying to count which country most celebrities come from. However the csv that I'm working with has multiple countries for a single celeb. e.g. "France, US" for someone with a double nationality.
To count the above, I can use .count() for the entries in the "nationality" column. But, I want to count France, US and any other country separately.
I cannot figure out a way to separate all the entries in column and then, count the occurrences.
I want to be able to reorder my dataframe with these counts, so I want to count this inside the structure
data.groupby(by="nationality").count()
This returns some faulty counts of
"France, US" 1
Assuming this type of data:
data = pd.DataFrame({'nationality': ['France','France, US', 'US', 'France']})
nationality
0 France
1 France, US
2 US
3 France
You need to split and explode, then use value_counts to get the sorted counts per country:
out = (data['nationality']
.str.split(', ')
.explode()
.value_counts()
)
Output:
France 3
US 2
Name: nationality, dtype: int64

Conditional merge and transformation of data in pandas

I have two data frames, and I want to create new columns in frame 1 using properties from frame 2
frame 1
Name
alice
bob
carol
frame 2
Name Type Value
alice lower 1
alice upper 2
bob equal 42
carol lower 0
desired result
frame 1
Name Lower Upper
alice 1 2
bob 42 42
carol 0 NA
Hence, the common column of both frames is Name. You can use Name to look up bounds (of a variable), which are specified in the second frame. Frame 1 lists each name exactly once. Frame 2 might have one or two entries per frame, which might either specify a lower or an upper bound (or both at a time if the type is equal). We do not need to have both bounds for each variable, one of the bounds can stay empty. I would like to have a frame that lists the range of each variable. I see how I can do that with for-loops over the columns, but that does not seem to be in the pandas spirit. Do you have any suggestions for a compact solution? :-)
Thanks in advance
You're not looking for a merge, but rather a pivot.
(df2[df2['Name'].isin(df1['Name'])]
.pivot('Name', 'Type', 'Value')
.reset_index()
)
But this doesn't handle the special 'equal' case.
For this, you can use a little trick. Replace 'equal' by a list with the other two values and explode to create the two rows.
(df2[df2['Name'].isin(df1['Name'])]
.assign(Type=lambda d: d['Type'].map(lambda x: {'equal': ['lower', 'upper']}.get(x,x)))
.explode('Type')
.pivot('Name', 'Type', 'Value')
.reset_index()
.convert_dtypes()
)
Output:
Name lower upper
0 alice 1 2
1 bob 42 42
2 carol 0 <NA>

Drop duplicates based on 2 columns if the value in another column is null - Pandas

If I have a dataframe
Index City Country State
0 Chicago US IL
1 Sacramento US CA
2 Sacramento US
3 Naperville US IL
I want to find rows with duplicate values for 'City' and 'Country' but only drop the row with no entry for 'State.
Ie. drop row#2
What is the best way to approach this using Pandas?
Use a boolean mask to get the index of rows to delete then use drop to remove this rows with inplace=True as argument:
df.drop(df.loc[(df.duplicated(['City','Country'])
& df['State'].notna())].index, inplace=True)
print(df)
# Output:
City Country State
0 Chicago US IL
1 Sacramento US CA
3 Naperville US IL
Note: the answer of #QuangHoang is the opposite of this one. I drop the bad rows, he keeps the right rows. Honestly, I prefer his method.

Compare last row to previous row by group and populate new column

I need to compare the last row of a group to the row above it, see if changes occur in a few columns, and populate a new column with 1 if a change occurs. The data presentation below will explain better.
Also need to account for having a group with only 1 row.
what we have:
Group Name Sport DogName Eligibility
1 Tom BBALL Toto Yes
1 Tom BBall Toto Yes
1 Tom golf spot Yes
2 Nancy vllyball Jimmy yes
2 Nancy vllyball rover no
what we want:
Group Name Sport DogName Eligibility N_change S_change D_Change E_change
1 Tom BBALL Toto Yes 0 0 0 0
1 Tom BBall Toto Yes 0 0 0 0
1 Tom golf spot Yes 0 1 1 0
2 Nancy vllyball Jimmy yes 0 0 0 0
2 Nancy vllyball rover no 0 0 1 1
Only care about changes from row to row within group. Thank you for any help in advance.
The rows are already ordered so we only need to last two of the group. If it is easier to compare sequential rows in a group then that is just as good for my purposes.
I did know this would be arrays and I struggle with these because never use them for my typical sas modeling. Wanted to keep things short and sweet.
Use the data step and lag statements. Ensure your data is sorted by group first, and that the rows within groups are sorted in the correct order. Using arrays will make your code much smaller.
The logic below will compare each row with the previous row. A flag of 1 will be set only if:
It's not the first row of the group
The current value differs from the previous value.
The syntax var = (test logic); is a shortcut to automatically generate dummy flags.
data want;
set have;
by group;
array var[*] name sport dogname eligibility;
array lagvar[*] $ lag_name lag_sport lag_dogname lag_eligibility;
array changeflag[*] N_change S_change D_change E_change;
do i = 1 to dim(var);
lagvar[i] = lag(var[i]);
changeflag[i] = (var[i] NE lagvar[i] AND NOT first.group);
end;
drop lag: i;
run;
It is not uncommon for procedural programmers to find this kind of dilemma in SQL, which is predominately a set language where rows have no position. If you write a procedure that reads the select data (sorted in the desired order), it can have variables to control creating the desired additional columns in the output, similar to the lag function above.
Or you can put it into a spreadsheet, which is happier detecting the changes in formula filled columns =if(a2<>a1,1,0). Just make sure nobody re-sorts the spreadsheet data into a new order!