Want to generate frequency table from all values in DataFrame. I do not want the values from the index and index can be destroyed.
Sample data:
col_list = ['ob1','ob2','ob3','ob4', 'ob5']
df = pd.DataFrame(np.random.uniform(73.965,74.03,size=(25, 5)).astype(float), columns=col_list)
My attempt based off this answer:
my_bins = [i for i in np.arange(73.965, 74.030, 0.005)]
df2 = df.apply(pd.Series.value_counts, bins=my_bins)
Code crashes, can't find another example that does what I'm trying.
Desired out put is a frequency table with counts for all values in bins. Something like this:
data_range
Frequency
73.965<=73.97
1
73.97<=73.975
0
73.98<=73.985
3
73.99<=73.995
2
And so on.
Your approach/code works fine with me.
my_bins = [i for i in np.arange(73.965, 74.030, 0.005)]
out1 = (
df.apply(pd.Series.value_counts, bins=my_bins)
.sum(axis=1).reset_index()
.set_axis(['data_range', 'Frequency'], axis=1)
)
#32.6 ms ± 803 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Here is different approach (using cut) that seems to be ~12x faster than apply.
my_bins = np.arange(73.965, 74.030, 0.005)
labels = [f"{np.around(l, 3)}<={np.around(r, 3)}"
for l, r in zip(my_bins[:-1], my_bins[1:])]
out2 = (
pd.Series(pd.cut(df.to_numpy().flatten(),
my_bins, labels=labels))
.value_counts(sort=False).reset_index()
.set_axis(['data_range', 'Frequency'], axis=1)
)
#2.42 ms ± 45.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Output :
print(out2)
data_range Frequency
0 73.965<=73.97 16
1 73.97<=73.975 0
2 73.975<=73.98 15
3 73.98<=73.985 12
4 73.985<=73.99 7
.. ... ...
7 74.0<=74.005 8
8 74.005<=74.01 9
9 74.01<=74.015 7
10 74.015<=74.02 7
11 74.02<=74.025 11
[12 rows x 2 columns]
Related
I have a dataframe of the following format
ROW Value1 Value2 Value3 Value4
1 10 10 -5 -2
2 50 20 -10 -7
3 10 5 0 -1
I am looking to calculate for each row the sum of positive totals and sum of negative totals. So essentially, the resulting frame should look like
ROW Post_Total Neg_Total
1 20 -7
2 70 -17
3 15 -1
One thing I have in my dataset, a column can have only positive or negative values.
Any ideas on how this can be done. I tried subsetting by >0 but was not successful.
Thanks!
Since all columns can either have all positive or all negative, you can use all() to check for condition along the columns, then groupby:
df.groupby(df.gt(0).all(), axis=1).sum()
Output:
False True
ROW
1 -7 20
2 -17 70
3 -1 15
In general, I'll just subset/clip and sum:
out = pd.DataFrame({'pos': df.clip(lower=0).sum(1),
'neg': df.clip(upper=0).sum(1)
})
Use DataFrame.melt, but if performance is important better are another solutions ;):
df1 = (df.melt('ROW')
.assign(g = lambda x: np.where(x['value'].gt(0),'Pos_Total','Neg_Total'))
.pivot_table(index='ROW',columns='g', values='value', aggfunc='sum', fill_value=0)
.reset_index()
.rename_axis(None, axis=1))
print (df1)
ROW Neg_Total Pos_Total
0 1 -7 20
1 2 -17 70
2 3 -1 15
Numpy alternative with numpy.clip:
a = df.set_index('ROW').to_numpy()
df = pd.DataFrame({'Pos_Total': np.sum(np.clip(a, a_min=0, a_max=None), 1),
'Neg_Total': np.sum(np.clip(a, a_min=None, a_max=0), 1)},
index=df['ROW'])
You could use:
(df.melt(id_vars='ROW')
.assign(sign=lambda d: np.where(d['value'].gt(0), 'Pos_Total', 'Neg_Total'))
.groupby(['ROW', 'sign'])['value'].sum()
.unstack('sign')
)
Or alternatively, using masks.
numpy version (faster):
import numpy as np
a = df.set_index('ROW').values
mask = a > 0
pd.DataFrame({'Pos_Total': np.where(mask, a, 0).sum(1),
'Neg_Total': np.where(mask, 0, a).sum(1)})
pandas version (slower than numpy but faster than melt):
d = df.set_index('ROW')
mask = d.gt(0)
pd.DataFrame({'Pos_Total': d.where(mask).sum(1),
'Neg_Total': d.mask(mask).sum(1)},
index=df['ROW'])
output:
Pos_Total Neg_Total
ROW
1 20.0 -7.0
2 70.0 -17.0
3 15.0 -1.0
Let us try apply
out = df.set_index('ROW').apply(lambda x : {'Pos':x[x>0].sum(),'Neg':x[x<0].sum()} ,
result_type = 'expand',
axis=1)
Out[33]:
Pos Neg
ROW
1 20 -7
2 70 -17
3 15 -1
Timing of all answer in order or speed. Computed with timeit on 30k rows with unique ROW values.
# #mozway+jezrael (numpy mask v2)
940 µs ± 10.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# #mozway (numpy mask):
1.29 ms ± 26.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# #Quang Hoang (groupby)
4.68 ms ± 184 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# #Quang Hoang (clip)
5.2 ms ± 91 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# #mozway (pandas mask)
10.5 ms ± 612 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# #mozway (melt+groupby)
36.2 ms ± 1.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# #jezrael (melt+pivot_table)
48.5 ms ± 740 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# #BENY (apply)
9.05 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
setup:
df = pd.DataFrame({'ROW': [1, 2, 3],
'Value1': [10, 50, 10],
'Value2': [10, 20, 5],
'Value3': [-5, -10, 0],
'Value4': [-2, -7, -1]})
df = pd.concat([df]*10000, ignore_index=True)
df['ROW'] = range(len(df))
I have this dataframe and I want to make a new column for which day of the week the collisions were on.
collision_date
0 2020-03-14
1 2020-07-26
2 2009-02-03
3 2009-02-28
4 2009-02-09
I have tried variations of this but nothing works.
df["day of the week"] = df["collision_date"].isoweekday()
df["day of the week"] = df["collision_date"].apply(isoweekday)
Assuming collision_date is datetime we can use dt.weekday (+1 to match isoweekday returning 1-7 instead of 0-6):
# Convert If needed
df['collision_date'] = pd.to_datetime(df['collision_date'])
# Turn into Number
df['day of week'] = df['collision_date'].dt.weekday + 1
The slower option with apply is to call isoweekday per date:
from datetime import date
# Convert If needed
df['collision_date'] = pd.to_datetime(df['collision_date'])
# Call isoweekday per date
df['day of week'] = df['collision_date'].apply(date.isoweekday)
df:
collision_date day of week
0 2020-03-14 6
1 2020-07-26 7
2 2009-02-03 2
3 2009-02-28 6
4 2009-02-09 1
Timing Information via timeit:
Sample Data with 1000 rows:
df = pd.DataFrame({
'collision_date': pd.date_range(start='now', periods=1000, freq='D')
})
dt.weekday:
%timeit df['collision_date'].dt.weekday + 1
261 µs ± 1.91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
apply:
%timeit df['collision_date'].apply(date.isoweekday)
2.53 ms ± 90.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Input df
Date1
2019-01-23
2020-02-01
note: The type of Date1 is datetime64[ns].
Goal
I want to calculate month diff between Date1 column and '2019-01-01'.
Try and Ref
I try the answers from this post , but it failed as below:
df['Date1'].dt.to_period('M') - pd.to_datetime('2019-01-01').to_period('M')
note:pandas version: 1.1.5
Your solution should be changed by convert periods to integers and for second value is used one element list ['2019-01-01']:
df['new'] = (df['Date1'].dt.to_period('M').astype(int) -
pd.to_datetime(['2019-01-01']).to_period('M').astype(int))
print (df)
Date1 new
0 2019-01-23 0
1 2020-02-01 13
If compare solutions:
rng = pd.date_range('1900-04-03', periods=3000, freq='MS')
df = pd.DataFrame({'Date1': rng})
In [106]: %%timeit
...: date_ref = pd.to_datetime('2019-01-01')
...: df["mo_since_2019_01"] = (df.Date1.dt.year - date_ref.year).values*12 + (df.Date1.dt.month - date_ref.month)
...:
1.57 ms ± 8.18 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [107]: %%timeit
...: df['new'] = (df['Date1'].dt.to_period('M').astype(int) - pd.to_datetime(['2019-01-01']).to_period('M').astype(int))
...:
1.32 ms ± 19.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Apply are loops under the hood, so slowier:
In [109]: %%timeit
...: start = pd.to_datetime("2019-01-01")
...: df['relative_months'] = df['Date1'].apply(lambda end: relative_months(start, end, freq="M"))
...:
25.7 s ± 729 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [110]: %%timeit
...: rd = df['Date1'].apply(lambda x:relativedelta(x,date(2019,1,1)))
...: mon = rd.apply(lambda x: ((x.years * 12) + x.months))
...: df['Diff'] = mon
...:
94.2 ms ± 431 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
I think this should work:
date_ref = pd.to_datetime('2019-01-01')
df["mo_since_2019_01"] = (df.Date1.dt.year - date_ref.year).values*12 + (df.Date1.dt.month - date_ref.month)
month_delta = (date2.year - date1.year)*12 + (date2.month - date1.month)
output:
Date1 mo_since_2019_01
0 2019-01-23 0
1 2020-02-01 13
With this solution, you won't need further imports as it simply calculates the length of the pd.date_range() between your fixed start date and varying end date:
def relative_months(start, end, freq="M"):
if start < end:
x = len(pd.date_range(start=start,end=end,freq=freq))
else:
x = - len(pd.date_range(start=end,end=start,freq=freq))
return x
start = pd.to_datetime("2019-01-01")
df['relative_months'] = df['Date1'].apply(lambda end: relative_months(start, end, freq="M"))
In your specific case, I think anon01's solution should be the quickest/ favorable; my variant however allows the use of generic frequency strings for date offsets like 'M', 'D', … and allows you to specifically handle the edge case of "negative" relative offsets (i.e. what happens if your comparison date is not earlier than all dates in Date1).
Try:
rd = df['Date1'].apply(lambda x:relativedelta(x,date(2019,1,1)))
mon = rd.apply(lambda x: ((x.years * 12) + x.months))
df['Diff'] = mon
Input:
Date1
0 2019-01-23
1 2020-02-01
2 2020-05-01
3 2020-06-01
Output:
Date1 Diff
0 2019-01-23 0
1 2020-02-01 13
2 2020-05-01 16
3 2020-06-01 17
I have a dataframe with 30000 columns and 4000 rows. Each cell entry contains an integer. For EVERY entry, I want to multiply the original contents with log(k/m),
where k is the total number of rows ie.4000
and m is the total number of non zero rows for THAT PARTICULAR COLUMN.
My current code uses apply:
for column in df.columns:
m = len(df[column].to_numpy().nonzero())
df[column] = df[column].apply(lambda x: x * np.log10(4000/m))
This takes me hours (????). I hope there is some faster way to do it, anyone have any ideas?
Thanks
First generate sample data:
np.random.seed(123)
df = pd.DataFrame(np.random.rand(4, 5)*500, columns=['A', 'B', 'C', 'D', 'E']).astype(int).replace(range(100, 200), 0)
Result:
A B C D E
0 348 0 0 275 359
1 211 490 342 240 0
2 0 364 219 29 0
3 368 91 87 265 265
Next I define a vector with containing the non-zero column counts:
non_zeros = df.ne(0).sum().values
# Giving me: array([3, 3, 3, 4, 2], dtype=int64)
From there I find the log-factor to each column:
faktor = np.mat(np.log10(len(df)/ non_zeros))
# giving me: matrix([[0.12493874, 0.12493874, 0.12493874, 0. , 0.30103 ]])
Then multiplying each column with it's factor and converting back to DataFrame:
res = np.multiply(np.mat(df), faktor)
df = pd.DataFrame(res)
With this solution you come around the non-tight loops in Python.
Hope it will bring some help.
#Dennis Hansen's answer is good, but if you're still need to iterate over column I would recommend not to use apply in your solution.
a = pd.DataFrame(np.random.rand(10000)) # define an arib. dataframe
a.iloc[5:500] = 0 # set some values to zero
Solution with apply performance:
>> %%timeit
>> b = a.apply(lambda x: x * np.log10(10000/len(a.to_numpy().nonzero())))
1.53 ms ± 3.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Solution without apply performance:
>> %%timeit
>> b = a*np.log10(10000/len(a.to_numpy().nonzero()))
849 µs ± 3.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Here is my dataframe :
import pandas as pd
df = pd.DataFrame({'animal':['dog','cat','rabbit','pig'],'color':['red','green','blue','purple'],\
'season':['spring,','summer','fall','winter']})
and I have a list
l = ['dog','green','purple']
with these dataframe and list, I wanna add another column to df, which is actually a result if column 'animal' or column 'color' matched some item of l(list).
so, the result(dataframe) I want is below(I wanna express a table):
pd.DataFrame({'animal':['dog','cat','rabbit','pig'],
'color':['red','green','blue','purple'],
'season':['spring,','summer','fall','winter'],
'tar_rm':[1,1,0,1] })
Do I have to iterate list in each rows of column?
I believe one of pandas' advantage is broadcasting but i'm not sure it's possible here...
Use:
cols = ['animal','color']
df['tar_rm'] = df[cols].isin(l).any(axis=1).astype(int)
print (df)
animal color season tar_rm
0 dog red spring 1
1 cat green summer 1
2 rabbit blue fall 0
3 pig purple winter 1
Details:
First compare filtered columns of DataFrame by DataFrame.isin:
print (df[cols].isin(l))
animal color
0 True False
1 False True
2 False False
3 False True
Then test if at least one True per rows by DataFrame.any:
print (df[cols].isin(l).any(axis=1))
0 True
1 True
2 False
3 True
dtype: bool
An last cast boolean to integers:
print (df[cols].isin(l).any(axis=1).astype(int))
0 1
1 1
2 0
3 1
dtype: int32
If performance is important compare by isin each column separately, convert to numpy array, chain by bitwise OR and last cast to integers:
df['tar_rm'] = (df['animal'].isin(l).values | df['color'].isin(l).values).astype(int)
Performance: Depends of number of tows, number of matched rows and number of values of list, so best test in real data:
l = ['dog','green','purple']
df = pd.concat([df] * 100000, ignore_index=True).sample(1)
In [173]: %timeit df['tar_rm'] = df[['animal','color']].isin(l).any(axis=1).astype(int)
2.11 ms ± 250 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [174]: %timeit df['tar_rm'] = (df['animal'].isin(l).values | df['color'].isin(l).values).astype(int)
487 µs ± 9.87 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [175]: %timeit df['tar_rm'] = np.where(df['animal'].isin(l) | df['color'].isin(l), 1, 0)
805 µs ± 15.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
using numpy
df['tar_rm'] = np.where(df['animal'].isin(l) | df['color'].isin(l), 1, 0)
Output
animal color season tar_rm
0 dog red spring, 1
1 cat green summer 1
2 rabbit blue fall 0
3 pig purple winter 1