Compare two data frames for different values in a column - pandas

I have two dataframe, please tell me how I can compare them by operator name, if it matches, then add the values ​​of quantity and time to the first data frame.
In [2]: df1 In [3]: df2
Out[2]: Out[3]:
Name count time Name count time
0 Bob 123 4:12:10 0 Rick 9 0:13:00
1 Alice 99 1:01:12 1 Jone 7 0:24:21
2 Sergei 78 0:18:01 2 Bob 10 0:15:13
85 rows x 3 columns 105 rows x 3 columns
I want to get:
In [5]: df1
Out[5]:
Name count time
0 Bob 133 4:27:23
1 Alice 99 1:01:12
2 Sergei 78 0:18:01
85 rows x 3 columns

Use set_index and add them together. Finally, update back.
df1 = df1.set_index('Name')
df1.update(df1 + df2.set_index('Name'))
df1 = df1.reset_index()
Out[759]:
Name count time
0 Bob 133.0 04:27:23
1 Alice 99.0 01:01:12
2 Sergei 78.0 00:18:01
Note: I assume time columns in both df1 and df2 are already in correct date/time format. If they are in string format, you need to convert them before running above commands as follows:
df1.time = pd.to_timedelta(df1.time)
df2.time = pd.to_timedelta(df2.time)

Related

convert some columns in a dataframe to la column of list in the dataframe

I would like to convert some of the columns to list in adataframe.
The dataframe, df:
Name salary department days other
0 ben 1000 A 90 abc
1 alex 3000 B 80 gf
2 linn 600 C 55 jgj
3 luke 5000 D 88 gg
The desired output, df1:
Name list other
0 ben [1000,A,90] abc
1 alex [3000,B,80] gf
2 linn [600,C,55] jgj
3 luke [5000,D,88] gg
You can slice and convert the columns to a list of list, then to a Series:
cols = ['salary', 'department', 'days']
out = (df.drop(columns=cols)
.join(pd.Series(df[cols].to_numpy().tolist(), name='list', index=df.index))
)
Output:
Name other list
0 ben abc [1000, A, 90]
1 alex gf [3000, B, 80]
2 linn jgj [600, C, 55]
3 luke gg [5000, D, 88]
If you want to preserve the order, then we can break it down into 3 parts, as #mozway mentioned in his answer
Define columns we want to group (as #mozway mentioned in his answer)
Find the first element's index (you can take it a step forward and find the smallest one, as the list won't be necessarily sorted as the DataFrame)
Insert the Series to the dataframe at the position we generated
cols = ['salary', 'department', 'other']
first_location = df.columns.get_loc(cols[0])
list_values = pd.Series(df[cols].values.tolist()) # converting values to one list
df.insert(loc=first_location, column='list', value=list_values) # inserting the Series in the desired location
df = df.drop(columns=cols) # dropping the columns we grouped together.
print(df)
Which results in:
Name list other
0 ben [1000, A, 90] abc
1 alex [3000, B, 80] gf
...

Sorting df by column name of type timestamp

I have a dataframe df which consists of columns of countries and rows of dates. The index is of type "DateTime."
I would like to sort the df by the values of each country by the last element in the series (eg, the latest date) and the graph the "top N" countries by this latest value.
I thought if I sorted the transpose of the df and then slice it, I would have what I need. Hence, if N = 10, then I would select df[0:9].
However,when I attempt to select the last column, I get a 'keyerror' message, referencing the selected column:
KeyError: '2021-03-28 00:00:00'.
I'm stumped....
df_T = df.transpose()
column_name = str(df_T.columns[-1])
df_T.sort_values(by = column_name, axis = 'columns', inplace = True)
#select the top 10 countries by latest value, eg
# plot df_T[0:9]
What I'm trying to do, example df:
A B C .... X Y Z
2021-03-29 10 20 5 .... 50 100 7
2021-03-28 9 19 4 .... 45 90 6
2021-03-27 8 15 2 .... 40 80 4
...
2021-01-03 0 0 0 .... 0 0 0
I want to select series representing by the greatest N values as of the latest index value (eg, latest date).

how to apply one hot encoding or get dummies on 2 columns together in pandas?

I have below dataframe which contain sample values like:-
df = pd.DataFrame([["London", "Cambridge", 20], ["Cambridge", "London", 10], ["Liverpool", "London", 30]], columns= ["city_1", "city_2", "id"])
city_1 city_2 id
London Cambridge 20
Cambridge London 10
Liverpool London 30
I need the output dataframe as below which is built while joining 2 city columns together and applying one hot encoding after that:
id London Cambridge Liverpool
20 1 1 0
10 1 1 0
30 1 0 1
Currently, I am using the below code which works one time on a column, please could you advise if there is any pythonic way to get the above output
output_df = pd.get_dummies(df, columns=['city_1', 'city_2'])
which results in
id city_1_Cambridge city_1_London and so on columns
You can add parameters prefix_sep and prefix to get_dummies and then use max if want only 1 or 0 values (dummies or indicator columns) or sum if need count 1 values :
output_df = (pd.get_dummies(df, columns=['city_1', 'city_2'], prefix_sep='', prefix='')
.max(axis=1, level=0))
print (output_df)
id Cambridge Liverpool London
0 20 1 0 1
1 10 1 0 1
2 30 0 1 1
Or if want processing all columns without id convert not processing column(s) to index first by DataFrame.set_index, then use get_dummies with max and last add DataFrame.reset_index:
output_df = (pd.get_dummies(df.set_index('id'), prefix_sep='', prefix='')
.max(axis=1, level=0)
.reset_index())
print (output_df)
id Cambridge Liverpool London
0 20 1 0 1
1 10 1 0 1
2 30 0 1 1

Take log values on multiple columns DataFrame

I need to take log values of each element of a column in a DataFrame.Also i want to add the resulted column to the previous dataframe.
This is my dataframe
df1=pd.read_csv('doctors.csv',encoding='latin-1')
These are the columns
Index(['PatientID', 'Pregnancies', 'PlasmaGlucose', 'DiastolicBloodPressure',
'TricepsThickness', 'SerumInsulin', 'BMI', 'DiabetesPedigree', 'Age',
'Diabetic', 'Physician'],
dtype='object')
I want to form a new column of logarithmic values for 'Age' column.
I believe need numpy.log10:
df = pd.DataFrame({'Age':[10,22,33,44,34,56,34]})
df['log'] = np.log10(df['Age'])
print (df)
Age log
0 10 1.000000
1 22 1.342423
2 33 1.518514
3 44 1.643453
4 34 1.531479
5 56 1.748188
6 34 1.531479

conversion column names into first row

I would like to convert the following dataframe into a json .
df:
A sector B sector C sector
TTM Ratio                 --   35.99   12.70  20.63  14.75      23.06
RRM Sales            --  114.57    1.51   5.02   1.00    4594.13
MQR book         1.48    2.64    1.02   2.46   2.73       2.74
TTR cash        --   14.33    7.41  15.35   8.59  513854.86
In order to do so by using the function df.to_json() I would need to have unique names in column and indices.
Therefore what I am looking for is to convert the column names into a row and have default column numbers . In short I would like the following output:
df:
0 1 2 3 4 5
A sector B sector C sector
TTM Ratio                 --   35.99   12.70  20.63  14.75      23.06
RRM Sales            --  114.57    1.51   5.02   1.00    4594.13
MQR book         1.48    2.64    1.02   2.46   2.73       2.74
TTR cash        --   14.33    7.41  15.35   8.59  513854.86
Turning the column names into the first row so I can make the conversion correctly .
You could also use vstack in numpy:
>>> df
x y z
0 8 7 6
1 6 5 4
>>> pd.DataFrame(np.vstack([df.columns, df]))
0 1 2
0 x y z
1 8 7 6
2 6 5 4
The columns become the actual first row in this case.
Use assign by list of range and original column names:
print (range(len(df.columns)))
range(0, 6)
#for python2 list can be omit
df.columns = [list(range(len(df.columns))), df.columns]
Or MultiIndex.from_arrays:
df.columns = pd.MultiIndex.from_arrays([range(len(df.columns)), df.columns])
Also is possible use RangeIndex:
print (pd.RangeIndex(len(df.columns)))
RangeIndex(start=0, stop=6, step=1)
df.columns = pd.MultiIndex.from_arrays([pd.RangeIndex(len(df.columns)), df.columns])
print (df)
0 1 2 3 4 5
A sector B sector C sector
TTM Ratio -- 35.99 12.70 20.63 14.75 23.06
RRM Sales -- 114.57 1.51 5.02 1.00 4594.13
MQR book 1.48 2.64 1.02 2.46 2.73 2.74
TTR cash -- 14.33 7.41 15.35 8.59 513854.86