Pandas Groupby integer with margin of error

Pandas Groupby integer with margin of error - pandas

I have a dataframe with two integer columns that represent the start and end of a string of text. I'd like to group my rows by length of text (end - start), but with a margin of error of +- 5 characters so that something like this would happen:
start end
0 251
1 250
2 250
0 500
1 500
0 499
How would I achieve something like this?
Here is the code I am using right now
d = {'text': ["aaa", "bbb", "ccc", "ddd", "eee", "fff"],
'start': [0, 1, 0, 2, 1, 0],
'end': [250, 500, 501, 251, 249, 499]}
df = pd.DataFrame(data=d)
df = df.groupby(['start', 'end'])

I ended up solving the problem by rounding the length of my text.
df['rounded_length'] = (df['end'] - df['start']).round(-1)
df = df.groupby('rounded_length')
All my values become multiples of 10, and I can group them this way.

Related

How to create a new column based on row values in python?

I have data like below:
df = pd.DataFrame()
df["collection_amount"] = 100, 200, 300
df["25%_coll"] = 1, 0, 1
df["75%_coll"] = 0, 1, 1
df["month"] = 4, 5, 6
I want to create a output like below:
basically if 25% is 1 then it should create a column based on month as a new column.
Please help me thank you.

This should work: do ask if something doesn't make sense
for i in range(len(df)):
if df['25%_coll'][i]==1:
df['month_%i_25%%_coll'%df.month[i]]=[df.collection_amount[i] if k==i else 0 for k in range(len(df))]
if df['75%_coll'][i]==1:
df['month_%i_75%%_coll'%df.month[i]]=[df.collection_amount[i] if k==i else 0 for k in range(len(df))]

To build the new columns you could try the following:
df2 = df.melt(id_vars=["month", "collection_amount"])
df2.loc[df2["value"].eq(0), "collection_amount"] = 0
df2["new_cols"] = "month_" + df2["month"].astype("str") + "_" + df2["variable"]
df2 = df2.pivot_table(
index="month", columns="new_cols", values="collection_amount",
fill_value=0, aggfunc="sum"
).reset_index(drop=True)
.melt() the dataframe with index columns month and collection_amount.
Set the appropriate collection_amount values to 0.
Build the new column names in column new_cols.
month collection_amount variable value new_cols
0 4 100 25%_coll 1 month_4_25%_coll
1 5 0 25%_coll 0 month_5_25%_coll
2 6 300 25%_coll 1 month_6_25%_coll
3 4 0 75%_coll 0 month_4_75%_coll
4 5 200 75%_coll 1 month_5_75%_coll
5 6 300 75%_coll 1 month_6_75%_coll
Use .pivot_table() on this dataframe to build the new columns.
The rest isn't completely clear: Either use df = pd.concat([df, df2], axis=1), or df.merge(df2, ...) to merge on month (with .reset_index() without drop=True).
Result for the sample dataframe
df = pd.DataFrame({
"collection_amount": [100, 200, 300],
"25%_coll": [1, 0, 1], "75%_coll": [0, 1, 1],
"month": [4, 5, 6]
})
is
new_cols month_4_25%_coll month_4_75%_coll month_5_25%_coll \
0 100 0 0
1 0 0 0
2 0 0 0
new_cols month_5_75%_coll month_6_25%_coll month_6_75%_coll
0 0 0 0
1 200 0 0
2 0 300 300

How to efficiently create a summary dataframe using list of lists in python

I have a list as follows.
mylist = [["cat", "dog"], ["dog", "rat"], ["parrot", "cat"], ["mouse", "rat"], ["mouse", "cat"]]
I want to get a summary dataframe for each animal as follows.
cat, dog, rat, parrot, mouse
cat 0, 1, 0, 1, 1
dog 1, 0, 1, 0, 0
rat 0, 1, 0, 0, 1
parrot 1, 0, 0, 0, 0
mouse 1, 0, 1, 0, 0
I am wondering if there is a standard way of doing this in pandas.
My current code is as follows.
import pandas as pd
mylist = [["cat", "dog"], ["dog", "rat"], ["parrot", "cat"], ["mouse", "rat"], ["mouse", "cat"]]
df = pd.DataFrame(mylist)
I am happy t provide more details if needed.

We can do it , with stack str.get_dummies and dot
#df=pd.DataFrame(mylist)
s=df.stack().str.get_dummies().sum(level=0)
s=s.T.dot(s)
s.values[tuple([np.arange(s.shape[0])]*2)] = 0
s
cat dog mouse parrot rat
cat 0 1 1 1 0
dog 1 0 0 0 1
mouse 1 0 0 0 1
parrot 1 0 0 0 0
rat 0 1 1 0 0

Let's try a different approach using pd.crosstab,
idx = ['cat', 'dog', 'rat', 'parrot', 'mouse']
df1 = pd.crosstab(df[0], df[1]).reindex(index=idx, columns=idx, fill_value=0)
result = df1 + df1.T

insert value into random row

I have a dataframe as below.
D1 = pd.DataFrame({'a': [15, 22, 107, 120],
'b': [25, 21, 95, 110]})
I am trying to randomly add two rows into column 'b' to get the effect of below. In each case the inserted 0 in this case shifts the rows down one.
D1 = pd.DataFrame({'a': [15, 22, 107, 120, 0, 0],
'b': [0, 25, 21, 0, 95, 110]})
Everything I have seen is about inserting into the whole column as opposed to individual rows.

Here is one potential way to achieve this using numpy.random.randint and numpy.insert:
import numpy as np
n = 2
rand_idx = np.random.randint(0, len(D1), size=n)
# Append 'n' rows of zeroes to D1
D2 = D1.append(pd.DataFrame(np.zeros((n, D1.shape[1])), columns=D1.columns, dtype=int), ignore_index=True)
# Insert n zeroes into random indices and assign back to column 'b'
D2['b'] = np.insert(D1['b'].values, rand_idx, 0)
print(D2)
a b
0 15 25
1 22 0
2 107 0
3 120 21
4 0 95
5 0 110

Use numpy.insert with set positions - for a by random and for b by length of original DataFrame:
n = 2
new = np.zeros(n, dtype=int)
a = np.insert(D1['b'].values, len(D1), new)
b = np.insert(D1['a'].values, np.random.randint(0, len(D1), size=n), new)
#python 0.24+
#a = np.insert(D1['b'].to_numpy(), len(D1), new)
#b = np.insert(D1['a'].to_numpy(), np.random.randint(0, len(D1), size=n), new)
df = pd.DataFrame({'a':a, 'b': b})
print (df)
a b
0 25 0
1 21 15
2 95 22
3 110 0
4 0 107
5 0 120

Count most recent zeros in pandas data frame

date_0 = list(pd.date_range('2017-01-01', periods=6, freq='MS'))
date_1 = list(pd.date_range('2017-01-01', periods=8, freq='MS'))
data_0 = [9, 8, 4, 0, 0, 0]
data_1 = [9, 9, 0, 0, 0, 7, 0, 0]
id_0 = [0]*6
id_1 = [1]*8
df = pd.DataFrame({'ids': id_0 + id_1, 'dates': date_0 + date_1, 'data': data_0 + data_1})
For each id (here 0 and 1) I want to know how long is the series of zeros at the end of the time frame.
For the given example, the result is id_0 = 3, id_1 = 2.
So how do I limit the timestamps, so I can run something like that:
df.groupby('ids').agg('count')

First need get all consecutive 0 with trick by compare with shifted values for not equal and cumsum.
Then count pre groups, remove first level of MultiIndex and get last values per group by drop_duplicates with keep='last':
s = df['data'].ne(df['data'].shift()).cumsum().mul(~df['data'].astype(bool))
df = (s.groupby([df['ids'], s]).size()
.reset_index(level=1, drop=True)
.reset_index(name='val')
.drop_duplicates('ids', keep='last'))
print (df)
ids val
1 0 3
4 1 2

Pandas Dataframe of dates with missing data selection acting strangely

When there is missing data in a Pandas DataFrame the indexing is not working as I would expect it it.
import pandas as pd
from datetime import datetime
df = pd.DataFrame({'a' : [datetime(2011, 1, 1), datetime(2013, 1, 1)],
'b' : [datetime(2010, 1, 1), datetime(2014, 1, 1)]})
df > datetime(2012, 1, 1)
works as expected:
a b
0 False False
1 True True
but if there is a missing value
none_df = pd.DataFrame({'a' : [datetime(2011, 1, 1), datetime(2013, 1, 1)],
'b' : [datetime(2010, 1, 1), None]})
none_df > datetime(2012, 1, 1)
the selection returns all True
a b
0 True True
1 True True
Am I doing something wrong? Is this desired behavior?
Python 3.5 64bit, Pandas 0.18.0, Windows 10

I agree that the behavior is unusual.
This is a work-around solution:
>>> df.apply(lambda col: col > datetime(2012, 1, 1))
a b
0 False False
1 True False

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Pandas Groupby integer with margin of error - pandas

I ended up solving the problem by rounding the length of my text. df['rounded_length'] = (df['end'] - df['start']).round(-1) df = df.groupby('rounded_length') All my values become multiples of 10, and I can group them this way.

Related

How to create a new column based on row values in python?

How to efficiently create a summary dataframe using list of lists in python

insert value into random row

Count most recent zeros in pandas data frame

Pandas Dataframe of dates with missing data selection acting strangely

Categories

Resources