how do I convert multiple columns into a list in ascending order? - pandas

I have a dataset of this type:
0
1
2
0:0
57:0
166:0
0:5
57:20
166:27
0:10
57:8:
166:36
0:27
57:4
166:45
i want to convert this dataframe into an ascending list. I want to check the whole data into ascending order and then create a list using the ascending order numbers in the dataframe. and ascending order should be of numbers before ':'
desired output:
list
0
57
166

You can unstack (or stack) to flatten to Series, then extract the number, convert to integer and keep the unique values in order:
For a python list you can try:
sorted(df.unstack()
.str.extract('(\d+):', expand=False)
.astype(int)
.unique().tolist())
output: [0, 57, 166]
As Series:
out = (df.unstack()
.str.extract('(\d+):', expand=False)
.astype(int)
.drop_duplicates()
.sort_values().reset_index(drop=True)
)
output:
0 0
1 57
2 166
dtype: int64

Related

Need to sort the pivot table based on the columns passed in index attribute . Its MultiIndex

Can't sort the pivot table based on the columns passed in index attribute in ascending order.
when the df is printed 'Deepthy' comes first for column Name, I need 'aarathy' to come first
pls check this image while printing
df = pd.DataFrame({'Name': ['aarathy', 'Deepthy','aarathy','aarathy'],'Ship': ['everest', 'Oasis of the Seas','everest','everest'], 'Tracking': ['TESTTRACK003', 'TESTTRACK008', 'TESTTRACK009','TESTTRACK005'],'Bag': ['123', '127', '129','121'],})
df=pd.pivot_table(df,index=["Name","Ship","Tracking","Bag"]).sort_index(axis=1,ascending=True)
I tried it by passing sort_values and sort_index(axis=1,ascending=True) but id doesn't works
You naeed convert values to lowercase and for first level of sorting use key parameter:
#helper column for run your code
df['new'] = 1
df=(pd.pivot_table(df,index=["Name","Ship","Tracking","Bag"])
.sort_index(level=0,ascending=True, key=lambda x: x.str.lower()))
print (df)
new
Name Ship Tracking Bag
aarathy everest TESTTRACK003 123 1
TESTTRACK005 121 1
TESTTRACK009 129 1
Deepthy Oasis of the Seas TESTTRACK008 127 1

Reshape a DataFrame based on column value, and pad missing slices with zeros

I have a Pandas DataFrame which looks like:
ID
order
other_column_1
other_column_x
A
0
10
20
A
1
11
21
A
2
12
22
B
0
31
41
B
2
33
43
I want to reshape it to a 3D matrix with shape (#IDs, #order, #other columns). For the example above, it should be of shape (2, 3, 2).
The order column holds the order of the 2nd dimension, so slice ['A', 0, :] should be [10, 20] and ['A', 1, :] [11, 21] etc. The values of order are identical for all ID (0, 1, 2 in this case).
Trouble is, sometimes a slice is missing e.g. for 'B', the slice (order) '1' is missing, which I want to make it a slice pad with all 0's, to keep the shape consistent.
I think of pre-sorting the whole DataFrame by ID and order, loop over each ID , insert missing slices, and stack them together. However, the DataFrame is huge so I try to avoid global sort and loop if possible.
I came up with a way to do it (if you have enough pc memory to allocate) where you dont have to loop the whole dataframe although I coudn't test it with 10M rows because of memory allocation. I tested it with 5M rows by 300 columns and I will show the results at the end of the answer.
The idea is to get all the combinations of the unique values of the first 2 columns as an index to build the first 2 dimensions of the 3D array.
After that you can merge the original dataframe with the dataframe containing index combinations to then fill all the missing values with 0.
Once the data is complete you can pass it to numpy and reshape it in 3 dimensions.
Code without comments:
# df = orginal dataframe
d1 = df.ID.unique()
d2 = df.order.unique()
df3 = pd.MultiIndex.from_product((d1, d2), names=['ID', 'order'])\
.to_frame().reset_index(drop=True)\
.merge(df, on=['ID', 'order'], how='left')\
.fillna(0)
np_3d_array = df3[df3.columns[2:]].to_numpy().reshape(d1.shape[0], d2.shape[0], df.columns[2:].shape[0])
Code with comments:
# df = orginal dataframe
# Get unique id for 1st dimension
d1 = df.ID.unique()
# Get unique order fpr 2nd dimension
d2 = df.order.unique()
# Get complete DF
df3 = pd.MultiIndex.from_product((d1, d2), names=['ID', 'order'])\ # Get missing values from 1st and 2nd dimensions as index
.to_frame().reset_index(drop=True)\ # Get Dataframe from multiindex and reset index
.merge(df, on=['ID', 'order'], how='left')\ # Merge the complete dimensions with the original values
.fillna(0) # fill missing values with 0
# get complete data as 2D array and reshape as 3D array
np_3d_array = df3[df3.columns[2:]].to_numpy().reshape(d1.shape[0], d2.shape[0], df.columns[2:].shape[0])
Test:
First I tried to test with 10M rows but I could not allocate the memory needed for that.
To test the code I created a a dataframe with 6M rows x 300 columns (random float numbers) and dropped 1M rows to simulate the missing values.
Here is the code I used to test and the results.
Test code:
import random
import time
import pandas as pd
import numpy as np
# 100000 diff. ID and 60 diff. order
df_test = pd.MultiIndex.from_product((range(100000), range(60)), names=['ID', 'order'])\
.to_frame().reset_index(drop=True)\
.drop(random.sample(range(6_000_000), k=1_000_000))\ # Drop 1M rows to simulate missing rows
.reset_index(drop=True)
# 5M rows random data by 298 columns
df_test2 = pd.DataFrame(np.random.random(size=(5_000_000, 298)))
df = df_test.merge(df_test2, left_index=True, right_index=True)
start = time.time()
d1 = df.ID.unique()
print(f'time 1st Dimension: {round(time.time()-start, 3)}')
d2 = df.order.unique()
print(f'time 2nd Dimension: {round(time.time()-start, 3)}')
df3 = pd.MultiIndex.from_product((d1, d2), names=['ID', 'order'])\
.to_frame().reset_index(drop=True)\
.merge(df, on=['ID', 'order'], how='left').fillna(0)
print(f'time merge: {round(time.time()-start, 3)}')
np_3d_array = df3[df3.columns[2:]].to_numpy().reshape(d1.shape[0], d2.shape[0], df.columns[2:].shape[0])
print(f'time ndarray: {round(time.time()-start, 3)}')
print(f'array shape: {np_3d_array.shape}')
print(f'array type: {type(np_3d_array)}')
Test Results:
time 1st Dimension: 0.035
time 2nd Dimension: 0.063
time merge: 47.202
time ndarray: 49.441
array shape: (100000, 60, 298)
array type: <class 'numpy.ndarray'>
ids = df.ID.unique()
orders = df.order.unique()
ar = (df.set_index(['ID','order'])
.reindex(pd.MultiIndex.from_product((ids, orders)))
.fillna(0)
.to_numpy()
.reshape(len(ids), len(orders), len(df.columns[2:])))
print(ar)
print(ar.shape)
Output:
[[[10. 20.]
[11. 21.]
[12. 22.]]
[[31. 41.]
[ 0. 0.]
[33. 43.]]]
(2, 3, 2)

Create dictionary from pandas dataframe

I have a pandas dataframe with data as such:
pandas dataframe
From this I need to create a dictionary where Key is Customer_ID and value is array of tuples(feat_id, feat_value).
Am getting close using to_dict() function on dataframe.
Thanks
you should first set Customer_ID as the DataFrame index and use df.to_dict with orient='index' to obtain a dict in the form {index -> {column -> value}}. (see Documentation). Then you can extract the values of the inner dictionaries using dict comprehension to obtain the tuples.
df_dict = {key: tuple(value.values())
for key, value in df.set_index('Customer_ID').to_dict('index').items()}
Use a comprehension:
out = {customer: [tuple(l) for l in subdf.to_dict('split')['data']]
for customer, subdf in df.groupby('Customer_ID')[['Feat_ID', 'Feat_value']]}
print(out)
# Output
{80: [(123, 0), (124, 0), (125, 0), (126, 0), (127, 0)]}
Input dataframe:
>>> df
Customer_ID Feat_ID Feat_value
0 80 123 0
1 80 124 0
2 80 125 0
3 80 126 0
4 80 127 0

Groupby two columns one of them is datetime

I have data frame that I want to groupby by two columns one of them is datetime type. How can I do this?
import pandas as pd
import datetime as dt
df = pd.DataFrame({
'a':np.random.randn(6),
'b':np.random.choice( [5,7,np.nan], 6),
'g':{1002,300,1002,300,1002,300}
'c':np.random.choice( ['panda','python','shark'], 6),
# some ways to create systematic groups for indexing or groupby
# this is similar to r's expand.grid(), see note 2 below
'd':np.repeat( range(3), 2 ),
'e':np.tile( range(2), 3 ),
# a date range and set of random dates
'f':pd.date_range('1/1/2011', periods=6, freq='D'),
'g':np.random.choice( pd.date_range('1/1/2011', periods=365,
freq='D'), 6, replace=False)
})
You can use pd.Grouper to specify groupby instructions. It can be used with pd.DatetimeIndex index to group data with specified frequency using the freq parameter.
Assumming that you have this dataframe:
df = pd.DataFrame(dict(
a=dict(date=pd.Timestamp('2020-05-01'), category='a', value=1),
b=dict(date=pd.Timestamp('2020-06-01'), category='a', value=2),
c=dict(date=pd.Timestamp('2020-06-01'), category='b', value=6),
d=dict(date=pd.Timestamp('2020-07-01'), category='a', value=1),
e=dict(date=pd.Timestamp('2020-07-27'), category='a', value=3),
)).T
You can set index to date column and it would be converted to pd.DatetimeIndex. Then you can use pd.Grouper among with another columns. For the following example I use category column.
freq='M' parameter used to group index using month frequency. There are number of string data series aliases that can be used in pd.Grouper
df.set_index('date').groupby([pd.Grouper(freq='M'), 'category'])['value'].sum()
Result:
date category
2020-05-31 a 1
2020-06-30 a 2
b 6
2020-07-31 a 4
Name: value, dtype: int64
Another example with your mcve:
df.set_index('g').groupby([pd.Grouper(freq='M'), 'c']).d.sum()
Result:
g c
2011-01-31 panda 0
2011-04-30 shark 2
2011-06-30 panda 2
2011-07-31 panda 0
2011-09-30 panda 1
2011-12-31 python 1
Name: d, dtype: int32

Pandas dataframe groupby cause drop columns

I have a pandas dataframe which need to group by a text column to obtain sum of duplicated values along that column. But when I run the groupby method it drop many columns mysteriously. Can anyone help me on this?
Try to check your column dtypes , sum will only for numeric value.
For example you have df as below :
df=pd.DataFrame({'V1':[1,2,3],'V2':['A','B','C'],'KEY':[1,2,2]})
df.dtypes
Out[159]:
KEY int64
V1 int64
V2 object
dtype: object
Then you groupby key and do sum for whole dataframe it will only return the result of numeric columns
df.groupby('KEY').sum()
Out[160]:
V1
KEY
1 1
2 5
If you need string type to join together you can
df.groupby('KEY',as_index=False).apply(lambda x : x.sum())
Out[164]:
KEY V1 V2
0 1 1 A
1 4 5 BC