Take log values on multiple columns DataFrame - pandas

I need to take log values of each element of a column in a DataFrame.Also i want to add the resulted column to the previous dataframe.
This is my dataframe
df1=pd.read_csv('doctors.csv',encoding='latin-1')
These are the columns
Index(['PatientID', 'Pregnancies', 'PlasmaGlucose', 'DiastolicBloodPressure',
'TricepsThickness', 'SerumInsulin', 'BMI', 'DiabetesPedigree', 'Age',
'Diabetic', 'Physician'],
dtype='object')
I want to form a new column of logarithmic values for 'Age' column.

I believe need numpy.log10:
df = pd.DataFrame({'Age':[10,22,33,44,34,56,34]})
df['log'] = np.log10(df['Age'])
print (df)
Age log
0 10 1.000000
1 22 1.342423
2 33 1.518514
3 44 1.643453
4 34 1.531479
5 56 1.748188
6 34 1.531479

Related

Add/subtract value of a column to the entire column of the dataframe pandas

I have a DataFrame like this, where for column2 I need to add 0.004 throughout the column to get a 0 value in row 1 of column 2. Similarly, for column 3 I need to subtract 0.4637 from the entire column to get a 0 value at row 1 column 3. How do I efficiently execute this?
Here is my code -
df2 = pd.DataFrame(np.zeros((df.shape[0], len(df.columns)))).round(0).astype(int)
for (i,j) in zip(range(0, 5999),range(1,len(df.columns))):
if j==1:
df2.values[i,j] = df.values[i,j] + df.values[0,1]
elif j>1:
df2.iloc[i,j] = df.iloc[i,j] - df.iloc[0,j]
print(df2)
Any help would be greatly appreciated. Thank you.
df2 = df - df.iloc[0]
Explanation:
Let's work through an example.
df = pd.DataFrame(np.arange(20).reshape(4, 5))
0
1
2
3
4
0
0
1
2
3
4
1
5
6
7
8
9
2
10
11
12
13
14
3
15
16
17
18
19
df.iloc[0] selects the first row of the dataframe:
0 0
1 1
2 2
3 3
4 4
Name: 0, dtype: int64
This is a Series. The first column printed here is its index (column names of the dataframe), and the second one - the actual values of the first row of the dataframe.
We can convert it to a list to better see its values
df.iloc[0].tolist()
[0, 1, 2, 3, 4]
Then, using broadcasting, we are subtracting each value from the whole column where it has come from.

Sorting df by column name of type timestamp

I have a dataframe df which consists of columns of countries and rows of dates. The index is of type "DateTime."
I would like to sort the df by the values of each country by the last element in the series (eg, the latest date) and the graph the "top N" countries by this latest value.
I thought if I sorted the transpose of the df and then slice it, I would have what I need. Hence, if N = 10, then I would select df[0:9].
However,when I attempt to select the last column, I get a 'keyerror' message, referencing the selected column:
KeyError: '2021-03-28 00:00:00'.
I'm stumped....
df_T = df.transpose()
column_name = str(df_T.columns[-1])
df_T.sort_values(by = column_name, axis = 'columns', inplace = True)
#select the top 10 countries by latest value, eg
# plot df_T[0:9]
What I'm trying to do, example df:
A B C .... X Y Z
2021-03-29 10 20 5 .... 50 100 7
2021-03-28 9 19 4 .... 45 90 6
2021-03-27 8 15 2 .... 40 80 4
...
2021-01-03 0 0 0 .... 0 0 0
I want to select series representing by the greatest N values as of the latest index value (eg, latest date).

Drop rows not equal to values for unique items - pandas

I've got a df that contains various strings that are associated with unique values. For these unique values, I want to drop the rows that are not equal to a separate list, except for the last row.
Using below, the various string values in Label are associated with Item. So for each unique Item, there could be multiple rows in Label with various strings. I only want to keep the strings that are in label_list, except for the last row.
I'm not sure I can do this another way as the amount of strings not in label_list is too many to account for. The ordering van also vary. So for each unique value in Item, I really only want the last row and whatever rows that are in label_list.
label_list = ['A','B','C','D']
df = pd.DataFrame({
'Item' : [10,10,10,10,10,20,20,20],
'Label' : ['A','X','C','D','Y','A','B','X'],
'Count' : [80.0,80.0,200.0,210.0,260.0,260.0,300.0,310.0],
})
df = df[df['Label'].isin(label_list)]
Intended output:
Item Label Value
0 10 A 80.0
1 10 C 200.0
2 10 D 210.0
3 10 Y 260.0
4 20 A 260.0
5 20 B 300.0
6 20 X 310.0
This comes to mind as a quick and dirty solution:
df = pd.concat([df[df['Label'].isin(label_list)],df.drop_duplicates('Item',keep='last')]).drop_duplicates(keep='first')
We are appending the last row of each Item group, but in case the last row is duplicsted because it is also in label_list we are using drop duplicates for the concatenated outout too.
check if 'Label' isin label_list
check if rows are duplicated
boolean slice the dataframe
isin_ = df['Label'].isin(label_list)
duped = df.duplicated('Item', keep='last')
df[isin_ | ~duped]
Item Label Count
0 10 A 80.0
2 10 C 200.0
3 10 D 210.0
4 10 Y 260.0
5 20 A 260.0
6 20 B 300.0
7 20 X 310.0

Sorting rows in the following unique manner (values for columns can be interchanged within the same row, to sort the row)

Input Data frame:
1. 0th col 1st_col 2nd_col
2. 23 46 6
3. 33 56 3
4. 243 2 21
The output data frame should be like:
Index
1. 0th col 1st_col 2nd_col
2. 6 23 46
3. 3 33 56
4. 2 21 243
The rows have to be sorted in ascending or descending order, Independent of columns, Means values for columns can be interchanged within the same row, to sort the row. Sorting rows in the following unique manner.
Please Help, I am in the middle of something very important.
Convert DataFrame to numpy array and sort by np.sort with axis=1, then create DataFrame by constructor:
df1 = pd.DataFrame(np.sort(df.to_numpy(), axis=1),
index=df.index,
columns=df.columns)
print (df1)
0th col 1st_col 2nd_col
1 6 23 46
2 3 33 56
3 2 21 243

Compare two data frames for different values in a column

I have two dataframe, please tell me how I can compare them by operator name, if it matches, then add the values ​​of quantity and time to the first data frame.
In [2]: df1 In [3]: df2
Out[2]: Out[3]:
Name count time Name count time
0 Bob 123 4:12:10 0 Rick 9 0:13:00
1 Alice 99 1:01:12 1 Jone 7 0:24:21
2 Sergei 78 0:18:01 2 Bob 10 0:15:13
85 rows x 3 columns 105 rows x 3 columns
I want to get:
In [5]: df1
Out[5]:
Name count time
0 Bob 133 4:27:23
1 Alice 99 1:01:12
2 Sergei 78 0:18:01
85 rows x 3 columns
Use set_index and add them together. Finally, update back.
df1 = df1.set_index('Name')
df1.update(df1 + df2.set_index('Name'))
df1 = df1.reset_index()
Out[759]:
Name count time
0 Bob 133.0 04:27:23
1 Alice 99.0 01:01:12
2 Sergei 78.0 00:18:01
Note: I assume time columns in both df1 and df2 are already in correct date/time format. If they are in string format, you need to convert them before running above commands as follows:
df1.time = pd.to_timedelta(df1.time)
df2.time = pd.to_timedelta(df2.time)