Formatting the data of multiple columns in a dataframe - dataframe

I've data frame that has several columns. A couple of columns have a very large values and I want to convert those into kilo, mega, giga format (eg: 10m, 20g etc). Got a custom function to do it but kind of stuck on how to apply it for selected columns. This is what I tried:
Df:
key avg max
--- --- ----
speed 100000 200000000
...
def convert(x: str):
if int(x) <= 0:
return 0
.. do conversion here
df.apply(apply(lambda x: convert if x.name in ['avg', 'max'] else x)
However, this doesn't seem to be doing the conversion and also giving an error message about the data type. The following works but it has to be done for each field separately.
df['avg'] = df['avg'].apply(func)
Is there a better way to do this?

Yes , you can do it like below:
for cols in df[['avg', 'max']].columns:
df[cols] = df[cols].apply(function)

Related

Aggregating multiple data types in pandas groupby

I have a data frame with rows that are mostly translations of other rows e.g. an English row and an Arabic row. They share an identifier (location_shelfLocator) and I'm trying to merge the rows together based on the identifier match. In some columns the Arabic doesn't contain a translation, but the same English value (e.g. for the language column both records might have ['ger'] which becomes ['ger', 'ger']) so I would like to get rid of these duplicate values. This is my code:
df_merged = df_filled.groupby("location_shelfLocator").agg(
lambda x: np.unique(x.tolist())
)
It works when the values being aggregated are the same type (e.g. when they are both strings or when they are both arrays). When one is a string and the other is an array, it doesn't work. I get this warning:
FutureWarning: ['subject_name_namePart'] did not aggregate successfully. If any error is raised this will raise in a future version of pandas. Drop these columns/ops to avoid this warning.
df_merged = df_filled.groupby("location_shelfLocator").agg(lambda x: np.unique(x.tolist()))
and the offending column is removed from the final data frame. Any idea how I can combine these values and remove duplicates when they are both lists, both strings, or one of each?
Here is some sample data:
location_shelfLocator,language_languageTerm,subject_topic,accessCondition,subject_name_namePart
81055/vdc_100000000094.0x000093,ara,"['فلك، العرب', 'فلك، اليونان', 'فلك، العصور الوسطى', 'الكواكب']",المُلكية العامة,كلاوديوس بطلميوس (بطليمو)
81055/vdc_100000000094.0x000093,ara,"['Astronomy, Arab', 'Astronomy, Greek', 'Astronomy, Medieval', 'Constellations']",Public Domain,"['Claudius Ptolemaeus (Ptolemy)', ""'Abd al-Raḥmān ibn 'Umar Ṣūfī""]"
And expected output:
location_shelfLocator,language_languageTerm,subject_topic,accessCondition,subject_name_namePart
"[‘81055/vdc_100000000094.0x000093’] ",[‘ara’],"['فلك، العرب', 'فلك، اليونان', 'فلك، العصور الوسطى', ‘الكواكب’, 'Astronomy, Arab', 'Astronomy, Greek', 'Astronomy, Medieval', 'Constellations']","[‘المُلكية العامة’, ‘Public Domain’]","[‘كلاوديوس بطلميوس (بطليمو)’,’Claudius Ptolemaeus (Ptolemy)', ""'Abd al-Raḥmān ibn 'Umar Ṣūfī""]"
If you cannot have a control over the input value, you need to fix it somehow.
Something like this. Here, I am converting string value in subject_name_namePart to array of string.
from ast import literal_eval
mask = df.subject_name_namePart.str[0] != '['
df.loc[mask, 'subject_name_namePart'] = "['" + df.loc[mask, 'subject_name_namePart'] + "']"
df['subject_name_namePart'] = df.subject_name_namePart.transform(literal_eval)
Then, you can do (explode) + aggregation.
df = df.explode('subject_name_namePart')
df = df.groupby('location_shelfLocator').agg(lambda x: x.unique().tolist())

Rolling apply lambda function based on condtion

I have a dataframe with normalised (to 100) returns for 18 products (columns). I want to apply a lambda function which multplies the next row by the previous row.
I can do :
df= df.rolling(2).apply(lambda x: (x[0]*x[1]),raw=True)
But some of my columns dont have values on row 1 (they go live on row 4). So I need to either:
Have a lambda function that starts only on row 4 yet applies to the entire df. I can create the first 4 rows manually.
As my values are 100 until "live" I could have the lambda function only applying when the value does not equal 100.
I have tried both :
1.
df.iloc[3:,:] = df.iloc[3:,:].rolling(2).apply(lambda x: (x[0]*x[1]),raw=True)
df= df.rolling(2).apply(lambda x: (x[0]*x[1]) if x[0] != 100 else x,raw=True)
But both meet with total failure.
Any advice welcomed - I've spent hours looking through the site and have yet to find any outcome that works for this situation.
So given the lack of responses I came up with a solution where I split my df in 2 parts and appended it back together.
My lambda function was also garbage I needed something like :
df2 = df.copy()
for i in range(df2.index.size):
if not i:
continue
df2.iloc[i] = (df2.iloc[i - 1] * (df.iloc[i]))
df2
to actually achieve what I was after.

How to show truncated form of large pandas dataframe after style.apply?

Normally, a relatively long dataframe like
df = pd.DataFrame(np.random.randint(0,10,(100,2)))
df
will display a truncated form in jupyter notebook like
With head, tail, ellipsis in between and row column count in the end.
However, after style.apply
def highlight_max(x):
return ['background-color: yellow' if v == x.max() else '' for v in x]
df.style.apply(highlight_max)
we got all rows displayed
Is it possible to still display the truncated form of dataframe after style.apply?
Something simple like this?
def display_df(dataframe, function):
display(dataframe.head().style.apply(function))
display(dataframe.tail().style.apply(function))
print(f'{dataframe.shape[0]} rows x {dataframe.shape[1]} columns')
display_df(df, highlight_max)
Output:
**** EDIT ****
def display_df(dataframe, function):
display(pd.concat([dataframe.iloc[:5,:],
pd.DataFrame(index=['...'], columns=dataframe.columns),
dataframe.iloc[-5:,:]]).style.apply(function))
print(f'{dataframe.shape[0]} rows x {dataframe.shape[1]} columns')
display_df(df, highlight_max)
Output:
The jupyter preview is basically something like this:
def display_df(dataframe):
display(pd.concat([dataframe.iloc[:5,:],
pd.DataFrame(index=['...'], columns=dataframe.columns, data={0: '...', 1: '...'}),
dataframe.iloc[-5:,:]]))
but if you try to apply style you are getting an error (TypeError: '>=' not supported between instances of 'int' and 'str') because it's trying to compare and highlight the string values '...'
You can capture the output in a variable and then use head or tail on it. This gives you more control on what you display every time.
output = df.style.apply(highlight_max)
output.head(10) # 10 -> number of rows to display
If you want to see more variate data you can also use sample, which will get random rows:
output.sample(10)

Manipulating duplicate rows across a subset of columns in dataframe pandas

Suppose I have a dataframe as follows:
df = pd.DataFrame({"user":[11,11,11,21,21,21,21,21,32,32],
"event":[0,0,1,0,0,1,1,1,0,0],
"datetime":['05:29:54','05:32:04','05:32:08',
'15:35:26','15:36:07','15:36:16','15:36:50','15:36:54',
'09:29:12', '09:29:25'] })
I would like to handle the repetitive lines across the first column (user) to reach the following.
In this case, we replace the 'event' column with the maximum value related in the 'user' column (for example for user=11, the maximum value for event is 1). And the third column is replaced by the average of the datetime.
P.S. It has been already discussed about dropping the repetitive rows here, however, I do not want to drop rows blindly. Especially when I am dealing with a dataframe with a lot of attributes.
You want to groupby and aggregate
df.groupby('user').agg({'event': 'max',
'datetime': lambda s: pd.to_timedelta(s).mean()})
If you want, you can also just change your datetime column first to timedelta using pd.to_timedelta and just take the mean in the agg
You can use str to represent the way you intend
df.groupby('user').agg({'event': 'max',
'datetime': lambda s: str(pd.to_timedelta(s).mean().to_pytimedelta())})
You can convert datetimes to native integers and aggregate mean, last convert back and for HH:MM:SS strings use strftime:
df['datetime'] = pd.to_datetime(df['datetime']).astype(np.int64)
df1 = df.groupby('user', as_index=False).agg({'event':'max', 'datetime':'mean'})
df1['datetime'] = pd.to_datetime(df1['datetime']).dt.strftime('%H:%M:%S')
print (df1)
user event datetime
0 11 1 05:31:22
1 21 1 15:36:18
2 32 0 09:29:18

What's the Pandas way to write `if()` conditional between two `timeseries` columns?

My naive approach to Pandas Series needs some pointers. I have one Pandas DataFrame with two joined tables. The left table had timestamp with title Time1 and the right had Time2; My new DataFrame has both.
At this step I'm comparing the two datetime columns using helper functions g() and f():
df['date_error'] = g(df['Time1'], df['Time2'])
The working helper function g() compares two datetime values:
def g(newer,older):
value = newer > older
return value
This gives me a column (True,False) values. When I use the conditional in the helper function f(), I get an error because newer and older are Pandas Series:
def f(newer,older):
if newer > older:
delta = (newer - older)
else :
# arbitrairly large value to maintain col dtype
delta = datetime.timedelta(minutes=1000)
return delta
Ok. Fine. I know I'm not unpacking the Pandas Series correctly, because I can get this to work with the following monstrosity:
def f(newer,older):
delta = []
for (k,v),(k2,v2) in zip(newer.iteritems(), older.iteritems()):
if v > v2 :
delta.append(v - v2)
else :
# arbitrairly large value to maintain col dtype
delta.append(datetime.timedelta(minutes=1000))
return pd.Series(delta)
What's the Pandas way a conditional between two DataFrame columns?
Usually where is the pandas equivalent of if:
df = pd.DataFrame([['1/1/01 11:00', '1/1/01 12:00'],
['1/1/01 14:00', '1/1/01 13:00']],
columns = ['Time1', 'Time2']
).apply(pd.to_datetime)
(df.Time1 - df.Time2).where(df.Time1 > df.Time2)
0 NaT
1 01:00:00
dtype: timedelta64[ns]
If you don't want nulls in this column you could call fillna(1000) afterwards, however note that this datatype supports a null value NaT (not a time).