I have a dataframe with two integer columns that represent the start and end of a string of text. I'd like to group my rows by length of text (end - start), but with a margin of error of +- 5 characters so that something like this would happen:
start end
0 251
1 250
2 250
0 500
1 500
0 499
How would I achieve something like this?
Here is the code I am using right now
d = {'text': ["aaa", "bbb", "ccc", "ddd", "eee", "fff"],
'start': [0, 1, 0, 2, 1, 0],
'end': [250, 500, 501, 251, 249, 499]}
df = pd.DataFrame(data=d)
df = df.groupby(['start', 'end'])
I ended up solving the problem by rounding the length of my text.
df['rounded_length'] = (df['end'] - df['start']).round(-1)
df = df.groupby('rounded_length')
All my values become multiples of 10, and I can group them this way.
Related
I have data like below:
df = pd.DataFrame()
df["collection_amount"] = 100, 200, 300
df["25%_coll"] = 1, 0, 1
df["75%_coll"] = 0, 1, 1
df["month"] = 4, 5, 6
I want to create a output like below:
basically if 25% is 1 then it should create a column based on month as a new column.
Please help me thank you.
This should work: do ask if something doesn't make sense
for i in range(len(df)):
if df['25%_coll'][i]==1:
df['month_%i_25%%_coll'%df.month[i]]=[df.collection_amount[i] if k==i else 0 for k in range(len(df))]
if df['75%_coll'][i]==1:
df['month_%i_75%%_coll'%df.month[i]]=[df.collection_amount[i] if k==i else 0 for k in range(len(df))]
To build the new columns you could try the following:
df2 = df.melt(id_vars=["month", "collection_amount"])
df2.loc[df2["value"].eq(0), "collection_amount"] = 0
df2["new_cols"] = "month_" + df2["month"].astype("str") + "_" + df2["variable"]
df2 = df2.pivot_table(
index="month", columns="new_cols", values="collection_amount",
fill_value=0, aggfunc="sum"
).reset_index(drop=True)
.melt() the dataframe with index columns month and collection_amount.
Set the appropriate collection_amount values to 0.
Build the new column names in column new_cols.
month collection_amount variable value new_cols
0 4 100 25%_coll 1 month_4_25%_coll
1 5 0 25%_coll 0 month_5_25%_coll
2 6 300 25%_coll 1 month_6_25%_coll
3 4 0 75%_coll 0 month_4_75%_coll
4 5 200 75%_coll 1 month_5_75%_coll
5 6 300 75%_coll 1 month_6_75%_coll
Use .pivot_table() on this dataframe to build the new columns.
The rest isn't completely clear: Either use df = pd.concat([df, df2], axis=1), or df.merge(df2, ...) to merge on month (with .reset_index() without drop=True).
Result for the sample dataframe
df = pd.DataFrame({
"collection_amount": [100, 200, 300],
"25%_coll": [1, 0, 1], "75%_coll": [0, 1, 1],
"month": [4, 5, 6]
})
is
new_cols month_4_25%_coll month_4_75%_coll month_5_25%_coll \
0 100 0 0
1 0 0 0
2 0 0 0
new_cols month_5_75%_coll month_6_25%_coll month_6_75%_coll
0 0 0 0
1 200 0 0
2 0 300 300
I have a list as follows.
mylist = [["cat", "dog"], ["dog", "rat"], ["parrot", "cat"], ["mouse", "rat"], ["mouse", "cat"]]
I want to get a summary dataframe for each animal as follows.
cat, dog, rat, parrot, mouse
cat 0, 1, 0, 1, 1
dog 1, 0, 1, 0, 0
rat 0, 1, 0, 0, 1
parrot 1, 0, 0, 0, 0
mouse 1, 0, 1, 0, 0
I am wondering if there is a standard way of doing this in pandas.
My current code is as follows.
import pandas as pd
mylist = [["cat", "dog"], ["dog", "rat"], ["parrot", "cat"], ["mouse", "rat"], ["mouse", "cat"]]
df = pd.DataFrame(mylist)
I am happy t provide more details if needed.
We can do it , with stack str.get_dummies and dot
#df=pd.DataFrame(mylist)
s=df.stack().str.get_dummies().sum(level=0)
s=s.T.dot(s)
s.values[tuple([np.arange(s.shape[0])]*2)] = 0
s
cat dog mouse parrot rat
cat 0 1 1 1 0
dog 1 0 0 0 1
mouse 1 0 0 0 1
parrot 1 0 0 0 0
rat 0 1 1 0 0
Let's try a different approach using pd.crosstab,
idx = ['cat', 'dog', 'rat', 'parrot', 'mouse']
df1 = pd.crosstab(df[0], df[1]).reindex(index=idx, columns=idx, fill_value=0)
result = df1 + df1.T
I have a dataframe as below.
D1 = pd.DataFrame({'a': [15, 22, 107, 120],
'b': [25, 21, 95, 110]})
I am trying to randomly add two rows into column 'b' to get the effect of below. In each case the inserted 0 in this case shifts the rows down one.
D1 = pd.DataFrame({'a': [15, 22, 107, 120, 0, 0],
'b': [0, 25, 21, 0, 95, 110]})
Everything I have seen is about inserting into the whole column as opposed to individual rows.
Here is one potential way to achieve this using numpy.random.randint and numpy.insert:
import numpy as np
n = 2
rand_idx = np.random.randint(0, len(D1), size=n)
# Append 'n' rows of zeroes to D1
D2 = D1.append(pd.DataFrame(np.zeros((n, D1.shape[1])), columns=D1.columns, dtype=int), ignore_index=True)
# Insert n zeroes into random indices and assign back to column 'b'
D2['b'] = np.insert(D1['b'].values, rand_idx, 0)
print(D2)
a b
0 15 25
1 22 0
2 107 0
3 120 21
4 0 95
5 0 110
Use numpy.insert with set positions - for a by random and for b by length of original DataFrame:
n = 2
new = np.zeros(n, dtype=int)
a = np.insert(D1['b'].values, len(D1), new)
b = np.insert(D1['a'].values, np.random.randint(0, len(D1), size=n), new)
#python 0.24+
#a = np.insert(D1['b'].to_numpy(), len(D1), new)
#b = np.insert(D1['a'].to_numpy(), np.random.randint(0, len(D1), size=n), new)
df = pd.DataFrame({'a':a, 'b': b})
print (df)
a b
0 25 0
1 21 15
2 95 22
3 110 0
4 0 107
5 0 120
date_0 = list(pd.date_range('2017-01-01', periods=6, freq='MS'))
date_1 = list(pd.date_range('2017-01-01', periods=8, freq='MS'))
data_0 = [9, 8, 4, 0, 0, 0]
data_1 = [9, 9, 0, 0, 0, 7, 0, 0]
id_0 = [0]*6
id_1 = [1]*8
df = pd.DataFrame({'ids': id_0 + id_1, 'dates': date_0 + date_1, 'data': data_0 + data_1})
For each id (here 0 and 1) I want to know how long is the series of zeros at the end of the time frame.
For the given example, the result is id_0 = 3, id_1 = 2.
So how do I limit the timestamps, so I can run something like that:
df.groupby('ids').agg('count')
First need get all consecutive 0 with trick by compare with shifted values for not equal and cumsum.
Then count pre groups, remove first level of MultiIndex and get last values per group by drop_duplicates with keep='last':
s = df['data'].ne(df['data'].shift()).cumsum().mul(~df['data'].astype(bool))
df = (s.groupby([df['ids'], s]).size()
.reset_index(level=1, drop=True)
.reset_index(name='val')
.drop_duplicates('ids', keep='last'))
print (df)
ids val
1 0 3
4 1 2
When there is missing data in a Pandas DataFrame the indexing is not working as I would expect it it.
import pandas as pd
from datetime import datetime
df = pd.DataFrame({'a' : [datetime(2011, 1, 1), datetime(2013, 1, 1)],
'b' : [datetime(2010, 1, 1), datetime(2014, 1, 1)]})
df > datetime(2012, 1, 1)
works as expected:
a b
0 False False
1 True True
but if there is a missing value
none_df = pd.DataFrame({'a' : [datetime(2011, 1, 1), datetime(2013, 1, 1)],
'b' : [datetime(2010, 1, 1), None]})
none_df > datetime(2012, 1, 1)
the selection returns all True
a b
0 True True
1 True True
Am I doing something wrong? Is this desired behavior?
Python 3.5 64bit, Pandas 0.18.0, Windows 10
I agree that the behavior is unusual.
This is a work-around solution:
>>> df.apply(lambda col: col > datetime(2012, 1, 1))
a b
0 False False
1 True False