I have a dataframe like below:
import pandas as pd
import numpy as np
np.random.seed(22)
df = pd.DataFrame.from_dict({'a': np.random.rand(200), 'b': np.random.rand(200), 'x': np.tile(np.concatenate([np.repeat('F', 5), np.repeat('G', 5)]), 20)})
df.index = pd.MultiIndex.from_product([[1, 2], list(range(0, 10)), [1, 2, 3, 4, 5, 1, 2, 3, 4, 5]])
df.index.names = ['g_id', 'm_id', 'object_id']
I'd like to shift the values for the entire groups defined by ['g_id', 'm_id'] so that for example:
the value of ['a', 'b'] columns in index (1, 1, 1) of the new data frame would be the value from index (1, 0, 1) of the original dataframe, i.e. [0.208461, 0.980866]
the value of ['a', 'b'] columns in index (2, 4, 3) of the new data frame would be the value from index (2, 3, 3) of the original data frame, i.e. [0.651138, 0.559126].
The operation is similar to the one covered in this topic. However, I need to do this with multiple columns and I had no luck trying to generalise the provided solution.
Related
usecols = [*range(1, 5), *range(7, 9), *range(11, 13)]
df = pd.read_csv('/content/drive/MyDrive/weather.csv', header=None, usecols=usecols, names=['d', 'm', 'y', 'time', 'temp1', 'outtemp', 'temp2', 'air_pressure', 'humidity'])
I'm trying this but always get
ValueError: Usecols do not match columns, columns expected but not found: [1, 2, 3, 4, 7, 8, 11, 12]
my data set looks like:
The problem you are seeing is due to a mismatch in the number of columns designated by usecols and the number columns designated by names.
Usecols:
[1, 2, 3, 4, 7, 8, 11, 12] - 8 columns
names:
'd', 'm', 'y', 'time', 'temp1', 'outtemp', 'temp2', 'air_pressure', 'humidity' - 9 columns
Change code so that the range in usecols ends in 14 rather than 13:
Code:
usecols = [*range(1, 5), *range(7, 9), *range(11, 14)]
df = pd.read_csv('/content/drive/MyDrive/weather.csv', header=None, usecols=usecols, names=['d', 'm', 'y', 'time', 'temp1', 'outtemp', 'temp2', 'air_pressure', 'humidity'])
Example df output:
Generally, I want to sort each ceil of some columns in pandas dataframe based on 1 column's value, That single column stores rank of other columns' value.
Suppose I have a dataframe like this, chrs has characters I want to sort, rank is the order of charaters for each row :
import pandas as pd
import numpy as np
import string
from operator import itemgetter
letters = list(string.ascii_lowercase)
np.random.seed(0)
# generate length for each row
data = pd.DataFrame({'col0': np.random.randint(2,10,10)})
# generate random string for each row
data['chrs'] = data.col0.apply(lambda x: ','.join(np.random.choice(letters) for i in range(x)))
# generate random rank for each row
data['rank_of_chr'] = data.col0.apply(lambda x: np.random.choice(x,x,replace = False))
data.iloc[:,1:]
chrs rank_of_chr
0 v,s,e,x,g,y [2, 3, 5, 1, 4, 0]
1 y,m,b,g,h,x,o,y,r [0, 4, 2, 3, 5, 6, 7, 1, 8]
2 f,z,n,i,j,u,t [4, 1, 5, 0, 6, 2, 3]
3 q,t [0, 1]
4 f,p,p,a,s [3, 0, 2, 1, 4]
5 d,y,r,t,t [1, 4, 2, 0, 3]
6 t,o,h,a,b [1, 2, 0, 3, 4]
7 j,z,a,k,u,x,d,l,s [7, 5, 1, 2, 3, 8, 6, 0, 4]
8 x,c,a [2, 0, 1]
9 a,e,v,f,g [0, 2, 3, 4, 1]
I want to sort chrs value base on rank_of_chr value for each row. For instance, for row 9, I want a,g,e,v,f(a,e,v,f,g with rank [0,2,3,4,1], rank is ascending just like rank() in sql).
Since the true data is 50,000,000 rows, I want to find the fastest methods for it.
What I have tried is:
use itertuple for each rows, use for loop to iter over each column I want to sort.
for each row, use np.argsort to get the index of sorted chr and then use itergetter to index original value of chrs
I revise dataframes' value inplace using dt.at[index,col_name] = new_value
cols_need_sort = ['chrs']
for i in data.itertuples():
this_order = np.argsort(list(map(int, data.loc[i.Index,'rank_of_chr'])))
for col_name in cols_need_sort:
data.at[i.Index, col_name] = itemgetter(*this_order)(data.loc[i.Index,col_name].split(','))
data.iloc[:,1:]
Any method to boost performance for this task?
I have a two data frames, one made up with a column of numpy array list, and other with two columns. I am trying to match the elements in the 1st dataframe (df) to get two columns, o1 and o2 from the df2, by matching based on index. I was wondering i can get some inputs.. please note the string 'A1' in column in 'o1' is repeated twice in df2 and as you may see in my desired output dataframe the duplicates are removed in column o1.
import numpy as np
import pandas as pd
array_1 = np.array([[0, 2, 3], [3, 4, 6], [1,2,3,6]])
#dataframe 1
df = pd.DataFrame({ 'A': array_1})
#dataframe 2
df2 = pd.DataFrame({ 'o1': ['A1', 'B1', 'A1', 'C1', 'D1', 'E1', 'F1'], 'o2': [15, 17, 18, 19, 20, 7, 8]})
#desired output
df_output = pd.DataFrame({ 'A': array_1, 'o1': [['A1', 'C1'], ['C1', 'D1', 'F1'], ['B1','A1','C1','F1']],
'o2': [[15, 18, 19], [19, 20, 8], [17,18,19,8]] })
# please note in the output, the 'index 0 of df1 has 0&2 which have same element i.e. 'A1', the output only shows one 'A1' by removing duplicated one.
I believe you can explode df and use that to extract information from df2, then finally join back to df
s = df['A'].explode()
df_output= df.join(df2.loc[s].groupby(s.index).agg(lambda x: list(set(x))))
Output:
A o1 o2
0 [0, 2, 3] [C1, A1] [18, 19, 15]
1 [3, 4, 6] [F1, D1, C1] [8, 19, 20]
2 [1, 2, 3, 6] [F1, B1, C1, A1] [8, 17, 18, 19]
How can I use pd.cut() in Dask?
Because of the large dataset, I am not able to put the whole dataset into memory before finishing the pd.cut().
Current code that is working in Pandas but needs to be changed to Dask:
import pandas as pd
d = {'name': [1, 5, 1, 10, 5, 1], 'amount': [1, 5, 3, 8, 4, 1]}
df = pd.DataFrame(data=d)
#Groupby name and add column sum (of amounts) and count (number of grouped rows)
df = (df.groupby('name')['amount'].agg(['sum', 'count']).reset_index().sort_values(by='name', ascending=True))
print(df.head(15))
#Groupby bins and chnage sum and count based on grouped rows
df = df.groupby(pd.cut(df['name'],
bins=[0,4,8,100],
labels=['namebin1', 'namebin2', 'namebin3']))['sum', 'count'].sum().reset_index()
print(df.head(15))
Output:
name sum count
0 namebin1 5 3
1 namebin2 9 2
2 namebin3 8 1
I tried:
import pandas as pd
import dask.dataframe as dd
d = {'name': [1, 5, 1, 10, 5, 1], 'amount': [1, 5, 3, 8, 4, 1]}
df = pd.DataFrame(data=d)
df = dd.from_pandas(df, npartitions=2)
df = df.groupby('name')['amount'].agg(['sum', 'count']).reset_index()
print(df.head(15))
df = df.groupby(df.map_partitions(pd.cut,
df['name'],
bins=[0,4,8,100],
labels=['namebin1', 'namebin2', 'namebin3']))['sum', 'count'].sum().reset_index()
print(df.head(15))
Gives error:
TypeError("cut() got multiple values for argument 'bins'",)
The reason why you're seeing this error is that pd.cut() is being called with the partition as the first argument which it doesn't expect (see the docs).
You can wrap it in a custom function and call that instead, like so:
import pandas as pd
import dask.dataframe as dd
def custom_cut(partition, bins, labels):
result = pd.cut(x=partition["name"], bins=bins, labels=labels)
return result
d = {'name': [1, 5, 1, 10, 5, 1], 'amount': [1, 5, 3, 8, 4, 1]}
df = pd.DataFrame(data=d)
df = dd.from_pandas(df, npartitions=2)
df = df.groupby('name')['amount'].agg(['sum', 'count']).reset_index()
df = df.groupby(df.map_partitions(custom_cut,
bins=[0,4,8,100],
labels=['namebin1', 'namebin2', 'namebin3']))[['sum', 'count']].sum().reset_index()
df.compute()
name sum count
namebin1 5 3
namebin2 9 2
namebin3 8 1
I am finding outliers from a column and storing them in a list. Now i want to delete all the values which
are present in my list from the column.
How can achieve this ?
This is my function for finding outliers
outlier=[]
def detect_outliers(data):
threshold=3
m = np.mean(data)
st = np.std(data)
for i in data:
#calculating z-score value
z_score=(i-m)/st
#if the z_score value is greater than threshold value than its a outlier
if np.abs(z_score)>threshold:
outlier.append(i)
return outlier
This is my column in data frame
df_train_11.AMT_INCOME_TOTAL
import numpy as np, pandas as pd
df = pd.DataFrame(np.random.rand(10,5))
outlier_list=[]
def detect_outliers(data):
threshold=0.5
for i in data:
#calculating z-score value
z_score=(df.loc[:,i]- np.mean(df.loc[:,i])) /np.std(df.loc[:,i])
outliers = np.abs(z_score)>threshold
outlier_list.append(df.index[outliers].tolist())
return outlier_list
outlier_list = detect_outliers(df)
[[1, 2, 4, 5, 6, 7, 9],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 4, 8],
[0, 1, 3, 4, 6, 8],
[0, 1, 3, 5, 6, 8, 9]]
This way, you get the outliers of each column. outlier_list[0] gives you [1, 2, 4, 5, 6, 7, 9] which means that the rows 1,2,etc are outliers for column 0.
EDIT
Shorter answer:
df = pd.DataFrame(np.random.randn(10, 3), columns=list('ABC'))
df[((df.B - df.B.mean()) / df.B.std()).abs() < 3]
This willfilter the DataFrame where only ONE column (e.g. 'B') is within three standard deviations.