I made value count dataframe from another dataframe
for example
freq
0 2
0.33333 10
1.66667 13
automatically, its indexs are 0, 0.3333, 1.66667
and the indexs can be variable
because I intend to make many dataframes based on a specific value
how can I insert a integer index?
like
freq
0 0 2
1 0.33333 10
2 1.66667 13
thanks
The result you get back from values_count is a series, and to set a generic 1 ... n index, you can use reset_index:
In [4]: s = pd.Series([0,0.3,0.3,1.6])
In [5]: s.value_counts()
Out[5]:
0.3 2
1.6 1
0.0 1
dtype: int64
In [9]: s.value_counts().reset_index(name='freq')
Out[9]:
index freq
0 0.3 2
1 1.6 1
2 0.0 1
Related
I have the following pandas dataframe:
0
0
A 0
B 0
C 0
C 4
A 1
A 7
Now there are some index letter (A and C) that appear multiple times. I want the values of these index letters on a extra column beside instead of a extra row. The desired pandas dataframe looks like:
0 1 3
0
A 0 1 7
B 0 np.nan np.nan
C 0 4 np.nan
Anything would help!
IIUC, you need to add a helper column:
(df.assign(group=df.groupby(level=0).cumcount())
.set_index('group', append=True)[0] # 0 is the name of the column here
.unstack('group')
)
or:
(df.reset_index()
.assign(group=lambda d: d.groupby('index').cumcount())
.pivot('index', 'group', 0) # col name here again
)
output:
group 0 1 2
A 0.0 1.0 7.0
B 0.0 NaN NaN
C 0.0 4.0 NaN
What is the difference between the 'set' operation using loc vs iloc?
df.iloc[2, df.columns.get_loc('ColName')] = 3
#vs#
df.loc[2, 'ColName'] = 3
Why does the website of iloc (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html) not have any set examples like those shown in loc website (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html)? Is loc the preferred way?
There isn't much of a difference to say. It all comes down to your need and requirement.
Say you have label of the index and column name (most of the time) you are supposed to use loc (location) operator to assign the values.
Whereas like in normal matrix, you usually are going to have only the index number of the row and column and hence the cell location via integers (for i) your are supposed to use iloc (integer based location) for assignment.
Pandas DataFrame support indexing via both usual integer based and index based.
The problem arise when the index (the row or column) is itself integer instead of some string. So to make a clear difference to what operation user want to perform using integer based or label based indexing the two operations is provided.
Main difference is iloc set values by position, loc by label.
Here are some alternatives:
Sample:
Not default index (if exist label 2 is overwritten cell, else appended new row with label):
np.random.seed(123)
df = pd.DataFrame(np.random.randint(10, size=(3,3)), columns=['A','B','C'], index=[2,1,8])
print (df)
A B C
2 2 2 6
1 1 3 9
8 6 1 0
df.iloc[2, df.columns.get_loc('A')] = 30
print (df)
A B C
2 2 2 6
1 1 3 9
8 30 1 0
Appended new row with 0:
df.loc[0, 'A'] = 70
print (df)
A B C
a 2.0 2.0 6.0
b 1.0 3.0 9.0
c 30.0 1.0 0.0
0 70.0 NaN NaN
Overwritten label 2:
df.loc[2, 'A'] = 50
print (df)
A B C
2 50 2 6
1 1 3 9
8 30 1 0
Default index (working same, because 3rd index has label 2):
df = pd.DataFrame(np.random.randint(10, size=(3,3)), columns=['A','B','C'])
print (df)
df.iloc[2, df.columns.get_loc('A')] = 30
print (df)
A B C
0 2 2 6
1 1 3 9
2 30 1 0
df.loc[2, 'A'] = 50
print (df)
A B C
0 2 2 6
1 1 3 9
2 50 1 0
Not integer index - (working for set by position, for select by label is appended new row):
np.random.seed(123)
df = pd.DataFrame(np.random.randint(10, size=(3,3)), columns=['A','B','C'], index=list('abc'))
print (df)
df.iloc[2, df.columns.get_loc('A')] = 30
print (df)
A B C
a 2 2 6
b 1 3 9
c 30 1 0
df.loc[2, 'A'] = 50
print (df)
A B C
a 2.0 2.0 6.0
b 1.0 3.0 9.0
c 30.0 1.0 0.0
2 50.0 NaN NaN
I'm trying to remove outliers in my data by dropping the largest element within an index level.
import pandas as pd
index = pd.MultiIndex.from_product([['A','B'],range(3)],names=['Letters','Numbers'])
s = pd.Series([0,2,1,2,0,2], index=index)
s
Out:
Letters Numbers
A 0 0
1 2
2 1
B 0 2
1 0
2 2
dtype: int64
s.groupby('Letters').nlargest(-1)
expected output
Out:
Letters Numbers
A 0 0
2 1
B 1 0
2 2
dtype: int64
Your solution should be changed with group_keys=False parameter in Series.groupby and then is used Series.drop by index values:
s = s.drop(s.groupby('Letters', group_keys=False).nlargest(1).index)
print (s)
Letters Numbers
A 0 0
2 1
B 1 0
2 2
dtype: int64
You can use idxmax and drop:
s.drop(s.groupby('Letters').idxmax())
# or
# s.drop(s.groupby(level=0).idxmax())
Output:
Letters Numbers
A 0 0
2 1
B 1 0
2 2
dtype: int64
I have a dataframe(dall), and I have a single row dataframe that has the same columns (row).
How to get d_result without writing a loop? I understand I can convert dataframe to numpy array and broadcast, but I would imagine Pandas has a way to do it directly. I have tried pd.mul, give me nan results.
dall = pd.DataFrame([[5,4,3], [3,5,5], [6,6,6]], columns=['a','b','c'])
row = pd.DataFrame([[-1, 100, 0]], columns=['a','b','c'])
d_result = pd.DataFrame([[-5,400,0], [-3,500,0], [-6,600,0]], columns=['a','b','c'])
dall
a b c
0 5 4 3
1 3 5 5
2 6 6 6
row
a b c
0 -1 100 0
d_result
a b c
0 -5 400 0
1 -3 500 0
2 -6 600 0
We can use mul
dall=dall.mul(row.loc[0],axis=1)
dall
Out[5]:
a b c
0 -5 400 0
1 -3 500 0
2 -6 600 0
You can do this by multiplying DataFrame obj to Series obj. Something like this:
dall * row.iloc[0]
I think this is essentially same as #WeNYoBen answer.
You can also multiply DataFrame obj to DataFrame obj like below. But be careful that NaN value will not propagate, because NaN value will be replaced to 1.0 before multiplication.
dall.mul(row, axis='columns', fill_value=1.0)
I have a MultiIndexed DataFrame with three levels of indices. I would like to expand my third level to contain all values in a given range, but only for the existing values in the two upper levels.
For example, assume the first level is name, the second level is date and the third level is hour. I would like to have rows for all 24 possible hours (even if some are currently missing), but only for the already existing names and dates. The values in new rows can be filled with zeros.
So a simple example input would be:
>>> import pandas as pd
>>> df = pd.DataFrame([[1,1,1,3],[2,2,1,4], [3,3,2,5]], columns=['A', 'B', 'C','val'])
>>> df.set_index(['A', 'B', 'C'], inplace=True)
>>> df
val
A B C
1 1 1 3
2 2 1 4
3 3 2 5
if the required values for C are [1,2,3], the desired output would be:
val
A B C
1 1 1 3
2 0
3 0
2 2 1 4
2 0
3 0
3 3 1 0
2 5
3 0
I know how to achieve this using groupby and applying a defined function for each group, but I was wondering if there was a cleaner way of doing this with reindex (I couldn't make this one work for a MultiIndex case, but perhaps I'm missing something)
Use -
partial_indices = [ i[0:2] for i in df.index.values ]
C_reqd = [1, 2, 3]
final_indices = [j+(i,) for j in partial_indices for i in C_reqd]
index = pd.MultiIndex.from_tuples(final_indices, names=['A', 'B', 'C'])
df2 = pd.DataFrame(pd.Series(0, index), columns=['val'])
df2.update(df)
Output
df2
val
A B C
1 1 1 3.0
2 0.0
3 0.0
2 2 1 4.0
2 0.0
3 0.0
3 3 1 0.0
2 5.0
3 0.0