Frequency Table from All DataFrame Data

Frequency Table from All DataFrame Data - pandas

Want to generate frequency table from all values in DataFrame. I do not want the values from the index and index can be destroyed.
Sample data:
col_list = ['ob1','ob2','ob3','ob4', 'ob5']
df = pd.DataFrame(np.random.uniform(73.965,74.03,size=(25, 5)).astype(float), columns=col_list)
My attempt based off this answer:
my_bins = [i for i in np.arange(73.965, 74.030, 0.005)]
df2 = df.apply(pd.Series.value_counts, bins=my_bins)
Code crashes, can't find another example that does what I'm trying.
Desired out put is a frequency table with counts for all values in bins. Something like this:
data_range
Frequency
73.965<=73.97
1
73.97<=73.975
0
73.98<=73.985
3
73.99<=73.995
2
And so on.

Your approach/code works fine with me.
my_bins = [i for i in np.arange(73.965, 74.030, 0.005)]
out1 = (
df.apply(pd.Series.value_counts, bins=my_bins)
.sum(axis=1).reset_index()
.set_axis(['data_range', 'Frequency'], axis=1)
)
#32.6 ms ± 803 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Here is different approach (using cut) that seems to be ~12x faster than apply.
my_bins = np.arange(73.965, 74.030, 0.005)

labels = [f"{np.around(l, 3)}<={np.around(r, 3)}"
for l, r in zip(my_bins[:-1], my_bins[1:])]

out2 = (
pd.Series(pd.cut(df.to_numpy().flatten(),
my_bins, labels=labels))
.value_counts(sort=False).reset_index()
.set_axis(['data_range', 'Frequency'], axis=1)
)
#2.42 ms ± 45.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Output :
print(out2)
data_range Frequency
0 73.965<=73.97 16
1 73.97<=73.975 0
2 73.975<=73.98 15
3 73.98<=73.985 12
4 73.985<=73.99 7
.. ... ...
7 74.0<=74.005 8
8 74.005<=74.01 9
9 74.01<=74.015 7
10 74.015<=74.02 7
11 74.02<=74.025 11
[12 rows x 2 columns]

Related

Pandas row sum for values > 0

I have a dataframe of the following format
ROW Value1 Value2 Value3 Value4
1 10 10 -5 -2
2 50 20 -10 -7
3 10 5 0 -1
I am looking to calculate for each row the sum of positive totals and sum of negative totals. So essentially, the resulting frame should look like
ROW Post_Total Neg_Total
1 20 -7
2 70 -17
3 15 -1
One thing I have in my dataset, a column can have only positive or negative values.
Any ideas on how this can be done. I tried subsetting by >0 but was not successful.
Thanks!

Since all columns can either have all positive or all negative, you can use all() to check for condition along the columns, then groupby:
df.groupby(df.gt(0).all(), axis=1).sum()
Output:
False True
ROW
1 -7 20
2 -17 70
3 -1 15
In general, I'll just subset/clip and sum:
out = pd.DataFrame({'pos': df.clip(lower=0).sum(1),
'neg': df.clip(upper=0).sum(1)
})

Use DataFrame.melt, but if performance is important better are another solutions ;):
df1 = (df.melt('ROW')
.assign(g = lambda x: np.where(x['value'].gt(0),'Pos_Total','Neg_Total'))
.pivot_table(index='ROW',columns='g', values='value', aggfunc='sum', fill_value=0)
.reset_index()
.rename_axis(None, axis=1))
print (df1)
ROW Neg_Total Pos_Total
0 1 -7 20
1 2 -17 70
2 3 -1 15
Numpy alternative with numpy.clip:
a = df.set_index('ROW').to_numpy()
df = pd.DataFrame({'Pos_Total': np.sum(np.clip(a, a_min=0, a_max=None), 1),
'Neg_Total': np.sum(np.clip(a, a_min=None, a_max=0), 1)},
index=df['ROW'])

You could use:
(df.melt(id_vars='ROW')
.assign(sign=lambda d: np.where(d['value'].gt(0), 'Pos_Total', 'Neg_Total'))
.groupby(['ROW', 'sign'])['value'].sum()
.unstack('sign')
)
Or alternatively, using masks.
numpy version (faster):
import numpy as np
a = df.set_index('ROW').values
mask = a > 0
pd.DataFrame({'Pos_Total': np.where(mask, a, 0).sum(1),
'Neg_Total': np.where(mask, 0, a).sum(1)})
pandas version (slower than numpy but faster than melt):
d = df.set_index('ROW')
mask = d.gt(0)
pd.DataFrame({'Pos_Total': d.where(mask).sum(1),
'Neg_Total': d.mask(mask).sum(1)},
index=df['ROW'])
output:
Pos_Total Neg_Total
ROW
1 20.0 -7.0
2 70.0 -17.0
3 15.0 -1.0

Let us try apply
out = df.set_index('ROW').apply(lambda x : {'Pos':x[x>0].sum(),'Neg':x[x<0].sum()} ,
result_type = 'expand',
axis=1)
Out[33]:
Pos Neg
ROW
1 20 -7
2 70 -17
3 15 -1

Timing of all answer in order or speed. Computed with timeit on 30k rows with unique ROW values.
# #mozway+jezrael (numpy mask v2)
940 µs ± 10.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# #mozway (numpy mask):
1.29 ms ± 26.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# #Quang Hoang (groupby)
4.68 ms ± 184 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# #Quang Hoang (clip)
5.2 ms ± 91 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# #mozway (pandas mask)
10.5 ms ± 612 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# #mozway (melt+groupby)
36.2 ms ± 1.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# #jezrael (melt+pivot_table)
48.5 ms ± 740 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# #BENY (apply)
9.05 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
setup:
df = pd.DataFrame({'ROW': [1, 2, 3],
'Value1': [10, 50, 10],
'Value2': [10, 20, 5],
'Value3': [-5, -10, 0],
'Value4': [-2, -7, -1]})
df = pd.concat([df]*10000, ignore_index=True)
df['ROW'] = range(len(df))

Create new dataframe column with isoweekday from datetime

I have this dataframe and I want to make a new column for which day of the week the collisions were on.
collision_date
0 2020-03-14
1 2020-07-26
2 2009-02-03
3 2009-02-28
4 2009-02-09
I have tried variations of this but nothing works.
df["day of the week"] = df["collision_date"].isoweekday()
df["day of the week"] = df["collision_date"].apply(isoweekday)

Assuming collision_date is datetime we can use dt.weekday (+1 to match isoweekday returning 1-7 instead of 0-6):
# Convert If needed
df['collision_date'] = pd.to_datetime(df['collision_date'])
# Turn into Number
df['day of week'] = df['collision_date'].dt.weekday + 1
The slower option with apply is to call isoweekday per date:
from datetime import date
# Convert If needed
df['collision_date'] = pd.to_datetime(df['collision_date'])
# Call isoweekday per date
df['day of week'] = df['collision_date'].apply(date.isoweekday)
df:
collision_date day of week
0 2020-03-14 6
1 2020-07-26 7
2 2009-02-03 2
3 2009-02-28 6
4 2009-02-09 1
Timing Information via timeit:
Sample Data with 1000 rows:
df = pd.DataFrame({
'collision_date': pd.date_range(start='now', periods=1000, freq='D')
})
dt.weekday:
%timeit df['collision_date'].dt.weekday + 1
261 µs ± 1.91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
apply:
%timeit df['collision_date'].apply(date.isoweekday)
2.53 ms ± 90.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Number of months between two dates while one date is given

Input df
Date1
2019-01-23
2020-02-01
note: The type of Date1 is datetime64[ns].
Goal
I want to calculate month diff between Date1 column and '2019-01-01'.
Try and Ref
I try the answers from this post , but it failed as below:
df['Date1'].dt.to_period('M') - pd.to_datetime('2019-01-01').to_period('M')
note:pandas version: 1.1.5

Your solution should be changed by convert periods to integers and for second value is used one element list ['2019-01-01']:
df['new'] = (df['Date1'].dt.to_period('M').astype(int) -
pd.to_datetime(['2019-01-01']).to_period('M').astype(int))
print (df)
Date1 new
0 2019-01-23 0
1 2020-02-01 13
If compare solutions:
rng = pd.date_range('1900-04-03', periods=3000, freq='MS')
df = pd.DataFrame({'Date1': rng})
In [106]: %%timeit
...: date_ref = pd.to_datetime('2019-01-01')
...: df["mo_since_2019_01"] = (df.Date1.dt.year - date_ref.year).values*12 + (df.Date1.dt.month - date_ref.month)
...:
1.57 ms ± 8.18 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [107]: %%timeit
...: df['new'] = (df['Date1'].dt.to_period('M').astype(int) - pd.to_datetime(['2019-01-01']).to_period('M').astype(int))
...:
1.32 ms ± 19.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Apply are loops under the hood, so slowier:
In [109]: %%timeit
...: start = pd.to_datetime("2019-01-01")
...: df['relative_months'] = df['Date1'].apply(lambda end: relative_months(start, end, freq="M"))
...:
25.7 s ± 729 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [110]: %%timeit
...: rd = df['Date1'].apply(lambda x:relativedelta(x,date(2019,1,1)))
...: mon = rd.apply(lambda x: ((x.years * 12) + x.months))
...: df['Diff'] = mon
...:
94.2 ms ± 431 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

I think this should work:
date_ref = pd.to_datetime('2019-01-01')
df["mo_since_2019_01"] = (df.Date1.dt.year - date_ref.year).values*12 + (df.Date1.dt.month - date_ref.month)
month_delta = (date2.year - date1.year)*12 + (date2.month - date1.month)
output:
Date1 mo_since_2019_01
0 2019-01-23 0
1 2020-02-01 13

With this solution, you won't need further imports as it simply calculates the length of the pd.date_range() between your fixed start date and varying end date:
def relative_months(start, end, freq="M"):
if start < end:
x = len(pd.date_range(start=start,end=end,freq=freq))
else:
x = - len(pd.date_range(start=end,end=start,freq=freq))
return x
start = pd.to_datetime("2019-01-01")
df['relative_months'] = df['Date1'].apply(lambda end: relative_months(start, end, freq="M"))
In your specific case, I think anon01's solution should be the quickest/ favorable; my variant however allows the use of generic frequency strings for date offsets like 'M', 'D', … and allows you to specifically handle the edge case of "negative" relative offsets (i.e. what happens if your comparison date is not earlier than all dates in Date1).

Try:
rd = df['Date1'].apply(lambda x:relativedelta(x,date(2019,1,1)))
mon = rd.apply(lambda x: ((x.years * 12) + x.months))
df['Diff'] = mon
Input:
Date1
0 2019-01-23
1 2020-02-01
2 2020-05-01
3 2020-06-01
Output:
Date1 Diff
0 2019-01-23 0
1 2020-02-01 13
2 2020-05-01 16
3 2020-06-01 17

Faster way to change each value in dataframe ACCORDING to original value

I have a dataframe with 30000 columns and 4000 rows. Each cell entry contains an integer. For EVERY entry, I want to multiply the original contents with log(k/m),
where k is the total number of rows ie.4000
and m is the total number of non zero rows for THAT PARTICULAR COLUMN.
My current code uses apply:
for column in df.columns:
m = len(df[column].to_numpy().nonzero())
df[column] = df[column].apply(lambda x: x * np.log10(4000/m))
This takes me hours (????). I hope there is some faster way to do it, anyone have any ideas?
Thanks

First generate sample data:
np.random.seed(123)
df = pd.DataFrame(np.random.rand(4, 5)*500, columns=['A', 'B', 'C', 'D', 'E']).astype(int).replace(range(100, 200), 0)
Result:
A B C D E
0 348 0 0 275 359
1 211 490 342 240 0
2 0 364 219 29 0
3 368 91 87 265 265
Next I define a vector with containing the non-zero column counts:
non_zeros = df.ne(0).sum().values
# Giving me: array([3, 3, 3, 4, 2], dtype=int64)
From there I find the log-factor to each column:
faktor = np.mat(np.log10(len(df)/ non_zeros))
# giving me: matrix([[0.12493874, 0.12493874, 0.12493874, 0. , 0.30103 ]])
Then multiplying each column with it's factor and converting back to DataFrame:
res = np.multiply(np.mat(df), faktor)
df = pd.DataFrame(res)
With this solution you come around the non-tight loops in Python.
Hope it will bring some help.

#Dennis Hansen's answer is good, but if you're still need to iterate over column I would recommend not to use apply in your solution.
a = pd.DataFrame(np.random.rand(10000)) # define an arib. dataframe
a.iloc[5:500] = 0 # set some values to zero
Solution with apply performance:
>> %%timeit
>> b = a.apply(lambda x: x * np.log10(10000/len(a.to_numpy().nonzero())))
1.53 ms ± 3.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Solution without apply performance:
>> %%timeit
>> b = a*np.log10(10000/len(a.to_numpy().nonzero()))
849 µs ± 3.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

how to iterate items of list in columns of dataframe

Here is my dataframe :
import pandas as pd
df = pd.DataFrame({'animal':['dog','cat','rabbit','pig'],'color':['red','green','blue','purple'],\
'season':['spring,','summer','fall','winter']})
and I have a list
l = ['dog','green','purple']
with these dataframe and list, I wanna add another column to df, which is actually a result if column 'animal' or column 'color' matched some item of l(list).
so, the result(dataframe) I want is below(I wanna express a table):
pd.DataFrame({'animal':['dog','cat','rabbit','pig'],
'color':['red','green','blue','purple'],
'season':['spring,','summer','fall','winter'],
'tar_rm':[1,1,0,1] })
Do I have to iterate list in each rows of column?
I believe one of pandas' advantage is broadcasting but i'm not sure it's possible here...

Use:
cols = ['animal','color']
df['tar_rm'] = df[cols].isin(l).any(axis=1).astype(int)
print (df)
animal color season tar_rm
0 dog red spring 1
1 cat green summer 1
2 rabbit blue fall 0
3 pig purple winter 1
Details:
First compare filtered columns of DataFrame by DataFrame.isin:
print (df[cols].isin(l))
animal color
0 True False
1 False True
2 False False
3 False True
Then test if at least one True per rows by DataFrame.any:
print (df[cols].isin(l).any(axis=1))
0 True
1 True
2 False
3 True
dtype: bool
An last cast boolean to integers:
print (df[cols].isin(l).any(axis=1).astype(int))
0 1
1 1
2 0
3 1
dtype: int32
If performance is important compare by isin each column separately, convert to numpy array, chain by bitwise OR and last cast to integers:
df['tar_rm'] = (df['animal'].isin(l).values | df['color'].isin(l).values).astype(int)
Performance: Depends of number of tows, number of matched rows and number of values of list, so best test in real data:
l = ['dog','green','purple']
df = pd.concat([df] * 100000, ignore_index=True).sample(1)
In [173]: %timeit df['tar_rm'] = df[['animal','color']].isin(l).any(axis=1).astype(int)
2.11 ms ± 250 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [174]: %timeit df['tar_rm'] = (df['animal'].isin(l).values | df['color'].isin(l).values).astype(int)
487 µs ± 9.87 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [175]: %timeit df['tar_rm'] = np.where(df['animal'].isin(l) | df['color'].isin(l), 1, 0)
805 µs ± 15.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

using numpy
df['tar_rm'] = np.where(df['animal'].isin(l) | df['color'].isin(l), 1, 0)
Output
animal color season tar_rm
0 dog red spring, 1
1 cat green summer 1
2 rabbit blue fall 0
3 pig purple winter 1

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Frequency Table from All DataFrame Data - pandas

Related

Pandas row sum for values > 0

Create new dataframe column with isoweekday from datetime

Number of months between two dates while one date is given

Faster way to change each value in dataframe ACCORDING to original value

how to iterate items of list in columns of dataframe

Categories

Resources