I am new to Pandas and I have a csv file that I want to move every row 2 & 3 to value1 and value2 column. Could someone please help me out? I can't seem to figure it out.
data, value1, value2
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
output would turn into this
one, value1, value2
1.00 2.00 3.00
4.00 5.00 6.00
7.00 8.00 9.00
More general solution is create MultiIndex.from_arrays with modulo and floor division of numpy.arange with unstack:
print (df)
data
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
7 8.0
8 9.0
9 10.0
a = np.arange(len(df.index))
print (a)
[0 1 2 3 4 5 6 7 8 9]
df.index = pd.MultiIndex.from_arrays([a % 3, a // 3])
print (df)
data
0 0 1.0
1 0 2.0
2 0 3.0
0 1 4.0
1 1 5.0
2 1 6.0
0 2 7.0
1 2 8.0
2 2 9.0
0 3 10.0
df1 = df['data'].unstack(0)
df1.columns=['data','value1','value2']
print (df1)
data value1 value2
0 1.0 2.0 3.0
1 4.0 5.0 6.0
2 7.0 8.0 9.0
3 10.0 NaN NaN
You can use a numpy method reshape then convert back to dataframe with pd.DataFrame and name your columns.
pd.DataFrame(df.values.reshape(3,3), columns=['data','value1','value2'])
Output:
data value1 value2
0 1 2 3
1 4 5 6
2 7 8 9
Related
I have the data as below, the new pandas version doesn't preserve the grouped columns after the operation of fillna/ffill/bfill. Is there a way to have the grouped data?
data = """one;two;three
1;1;10
1;1;nan
1;1;nan
1;2;nan
1;2;20
1;2;nan
1;3;nan
1;3;nan"""
df = pd.read_csv(io.StringIO(data), sep=";")
print(df)
one two three
0 1 1 10.0
1 1 1 NaN
2 1 1 NaN
3 1 2 NaN
4 1 2 20.0
5 1 2 NaN
6 1 3 NaN
7 1 3 NaN
print(df.groupby(['one','two']).ffill())
three
0 10.0
1 10.0
2 10.0
3 NaN
4 20.0
5 20.0
6 NaN
7 NaN
With the most recent pandas if we would like keep the groupby columns , we need to adding apply here
out = df.groupby(['one','two']).apply(lambda x : x.ffill())
Out[219]:
one two three
0 1 1 10.0
1 1 1 10.0
2 1 1 10.0
3 1 2 NaN
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN
Does it what you expect?
df['three']= df.groupby(['one','two'])['three'].ffill()
print(df)
# Output:
one two three
0 1 1 10.0
1 1 1 10.0
2 1 1 10.0
3 1 2 NaN
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN
Yes please set the index and then try grouping it so that it will preserve the columns as shown here:
df = pd.read_csv(io.StringIO(data), sep=";")
df.set_index(['one','two'], inplace=True)
df.groupby(['one','two']).ffill()
I have a column. how I can make a new column to count repetative positive and negative signs?
col1
-5
-3
-7
4
5
-0.5
6
8
9
col1 count_sign
-5 3
-3 3
-7 3
4 2
5 2
-0.5 1
6 3
8 3
9 3
the first 3 rows are 3 because we have 3 negative signs in the first 3 rows, then 2 positive signs and ....
# identify the change of signs among rows,
# making count as NaN, where sign is same, else 1
df['count']=np.where(np.sign(df['col1']).diff().eq(0),
np.nan,
1)
# cumsum to group the rows
df['count']=df['count'].cumsum().ffill()
# groupby to take count of each group of rows and return groupsize using transform
df['count']=df.groupby('count')['col1'].transform('size')
df
col1 count
0 -5.0 3
1 -3.0 3
2 -7.0 3
3 4.0 2
4 5.0 2
5 -0.5 1
6 6.0 3
7 8.0 3
8 9.0 3
To add a sign to the count values
df['count']=np.where(np.sign(df['col1']).diff().eq(0),
np.nan,
1)
df['count']=df['count'].cumsum().ffill()
df['count']=df.groupby('count')['col1'].transform('size')*np.sign(df['col1'])
df
col1 count
0 -5.0 -3.0
1 -3.0 -3.0
2 -7.0 -3.0
3 4.0 2.0
4 5.0 2.0
5 -0.5 -1.0
6 6.0 3.0
7 8.0 3.0
8 9.0 3.0
I have this dataframe:
value limit_1 limit_2 limit_3 limit_4
10 2 3 7 10
11 5 6 11 13
2 0.3 0.9 2.01 2.99
I want to add another column called class that classifies the value column this way:
if value <= limit1.value then 1
if value > limit1.value and <= limit2.value then 2
if value > limit2.value and <= limit3.value then 3
if value > limit3.value then 4
to get this result:
value limit_1 limit_2 limit_3 limit_4 CLASS
10 2 3 7 10 4
11 5 6 11 13 3
2 0.3 0.9 2.01 2.99 3
I know I could work to get these 'if's to work but my dataframe has 2kk rows and I need the fasted way to perform such classification.
I tried to use .cut function but the result was not what I expected/wanted
Thanks
We can use the rank method over the column axis (axis=1):
df["CLASS"] = df.rank(axis=1, method="first").iloc[:, 0].astype(int)
value limit_1 limit_2 limit_3 limi_4 CLASS
0 10 2.0 3.0 7.00 10.00 4
1 11 5.0 6.0 11.00 13.00 3
2 2 0.3 0.9 2.01 2.99 3
We can use np.select:
import numpy as np
conditions = [df["value"]<df["limit_1"],
df["value"].between(df["limit_1"], df["limit_2"]),
df["value"].between(df["limit_2"], df["limit_3"]),
df["value"]>df["limit_3"]]
df["CLASS"] = np.select(conditions, [1,2,3,4])
>>> df
value limit_1 limit_2 limit_3 limit_4 CLASS
0 10 2.0 3.0 7.00 10.00 4
1 11 5.0 6.0 11.00 13.00 3
2 2 0.3 0.9 2.01 2.99 3
I have a dataframe:
# create example df
df = pd.DataFrame(index=[1,2,3,4,5,6,7])
df['ID'] = [1,1,1,1,2,2,2]
df['election_date'] = pd.date_range("01/01/2010", periods=7, freq="M")
df['stock_price'] = [1,np.nan,np.nan,4,5,np.nan,7]
# sort values
df.sort_values(['election_date'], inplace=True, ascending=False)
df.reset_index(drop=True, inplace=True)
df
ID election_date stock_price
0 2 2010-07-31 7.0
1 2 2010-06-30 NaN
2 2 2010-05-31 5.0
3 1 2010-04-30 4.0
4 1 2010-03-31 NaN
5 1 2010-02-28 NaN
6 1 2010-01-31 1.0
I would like to calculate the cumulative count of all np.nan for column stock_price for every ID.
The expected result is:
df
ID election_date stock_price cum_count_nans
0 2 2010-07-31 7.0 1
1 2 2010-06-30 NaN 0
2 2 2010-05-31 5.0 0
3 1 2010-04-30 4.0 2
4 1 2010-03-31 NaN 1
5 1 2010-02-28 NaN 0
6 1 2010-01-31 1.0 0
Any ideas how to solve it?
Idea is change order by indexing, and then in custom function testing missing values, shifting and used cumlative sum:
f = lambda x: x.isna().shift(fill_value=0).cumsum()
df['cum_count_nans'] = df.iloc[::-1].groupby('ID')['stock_price'].transform(f)
print (df)
ID election_date stock_price cum_count_nans
0 2 2010-07-31 7.0 1
1 2 2010-06-30 NaN 0
2 2 2010-05-31 5.0 0
3 1 2010-04-30 4.0 2
4 1 2010-03-31 NaN 1
5 1 2010-02-28 NaN 0
6 1 2010-01-31 1.0 0
for this dataframe
values ii
0 3.0 4
1 0.0 1
2 3.0 8
3 2.0 5
4 2.0 1
5 3.0 5
6 2.0 4
7 1.0 8
8 0.0 5
9 1.0 1
This line returns "Must ptoduce aggregated values
bii2=df.groupby(['ii'])['values'].agg(pd.Series.mode)
While this line works
bii3=df.groupby('ii')['values'].agg(lambda x: pd.Series.mode(x)[0])
Could you explain why is that?
Problem is mode return sometimes 2 or more values, check solution with GroupBy.apply:
bii2=df.groupby(['ii'])['values'].apply(pd.Series.mode)
print (bii2)
ii
1 0 0.0
1 1.0
2 2.0
4 0 2.0
1 3.0
5 0 0.0
1 2.0
2 3.0
8 0 1.0
1 3.0
Name: values, dtype: float64
And pandas agg need scalar in output, so return error. So if select first value it working nice
bii3=df.groupby('ii')['values'].agg(lambda x: pd.Series.mode(x).iat[0])
print (bii3)
ii
1 0.0
4 2.0
5 0.0
8 1.0
Name: values, dtype: float64