check sorting by year and quarter pandas dataframe - pandas

I have a df that looks like below
date col1 col2
0 2000 Q1 123 456
1 2000 Q2 234 567
2 2000 Q3 345 678
3 2000 Q4 456 789
4 2001 Q1 567 890
The df has over 200 rows. I need to -
check if the data is sorted by date
if not, then sort it by date
Can someone please help me with this?
Many thanks

Use DataFrame.sort_values with key parameter and converting values to datetimes:
df = df.sort_values('date', key=lambda x: pd.to_datetime(x.str.replace('\s+', '')))
print (df)
date col1 col2
0 2000 Q1 123 456
1 2000 Q2 234 567
2 2000 Q3 345 678
3 2000 Q4 456 789
4 2001 Q1 567 890
EDIT: You can use Series.is_monotonic for test if values are monotonic_increasing:
if not df['date'].is_monotonic:
df = df.sort_values('date', key=lambda x: pd.to_datetime(x.str.replace('\s+', '')))

You can convert your date column as pd.Index (or define it as the index of your dataframe):
if not pd.Index(df['date']).is_monotonic_increasing:
df = df.sort_values('date')

Related

Duplicated IDs pandas

I have the following dataframes (df1,df2):
ID
Q1
111
2
111
3
112
1
ID
Q2
111
1
111
5
112
7
Since the IDs are duplicated, I want to reinitialize them, using the following code:
df1.sort_values('ID',inplace=True)
df1['ID_new'] = range(len(df1))
df2.sort_values('ID',inplace=True)
df2['ID_new'] = range(len(df2))
in order to have smth like this:
ID_new
ID
Q1
0
111
2
1
111
3
2
112
1
ID_new
ID
Q2
0
111
1
1
111
5
2
112
7
The question is: are we sure that the ID_new will be the same for df1 and df2?
For example:
is it possible that ID_new = 1 corresponds to the first ID=111 in df1 and to the second ID = 111 in df2?
If yes, there is another way to reinitialize it in a more robust way?

Get the max, min value and the respective owner of that max, min price - Pandas

I have this df:
df = pd.DataFrame({'code_1':[123,456,789,456,123],
'code_2':[333,666,999,666,333],
'seller':['Andres','Mike','Paul','Andy','Mike'],
'price':[12.2,23.51,12.34,10.2,13.0]})
and I want to get this as output:
dff = pd.DataFrame({'code_1':[123,456,789],
'code_2':[333,666,999],
'seller_maxprice':['Mike','Mike','Paul'],
'max_price':[13,23.51,12.34],
'avg_price':[12.6,16.85,12.34],
'seller_mainprice':['Andres','Andy','Paul'],
'min_price':[12.2,10.20,12.34]})
So, from DF I want to get the maxprice, minprice, avgprice, sellermaxprice and sellerminprice of cod_1 and/or cod_2
And there´s one more issue: I dont have both unique indentifier: cod_1 and cod_2 all the time. Sometimes I Have cod_1 and cod_2 is null and vice-versa
I tried:
result = df.groupby(['code_1']).agg({'price': ['mean', 'min', 'max']})
But this approach has two issues:
1: will fail when code_1 is null
2: Does not shoe the seller max and seller min_price
an anyone help me with this?
Your approach is the right one, we can improve it in a number of small ways:
If you set seller as the index, then your seller min/maxprice become idxmin and idxmax − it’s OK for the index to be non-unique
If you expect code_1 to be null, you can use fillna to add some placeholder values
If you use ['price'] after .groupby you’ll get a SeriesGroupBy instead of a DataFrameGroupBy which means you don’t have to specify which columns to aggregate in .agg()
You can define the column names as keys in .agg()
>>> df.set_index('seller').fillna({'code_1': 'missing'})\
... .groupby(['code_1', 'code_2'])['price']\
... .agg(maxprice='max', minprice='min', aver_price='mean',
... seller_minprice='idxmin', seller_maxprice='idxmax',
... ).reset_index()
...
code_1 code_2 maxprice minprice aver_price seller_minprice seller_maxprice
0 123 333 13.00 12.20 12.600 Andres Mike
1 456 666 23.51 10.20 16.855 Andy Mike
2 789 999 12.34 12.34 12.340 Paul Paul
You can do that with groupby and apply
dff = df.groupby(['code_1','code_2']).apply(lambda x : pd.Series({'seller_maxprice':x.loc[x['price'].idxmax(),'seller'],
'max_price':x['price'].max(),
'seller_minprice': x.loc[x['price'].idxmin(), 'seller'],
'min_price': x['price'].min(),
'aver_price':x['price'].mean()
})).reset_index()
Out[22]:
code_1 code_2 seller_maxprice ... seller_minprice min_price aver_price
0 123 333 Mike ... Andres 12.20 12.600
1 456 666 Mike ... Andy 10.20 16.855
2 789 999 Paul ... Paul 12.34 12.340
[3 rows x 7 columns]

Convert a tab- and newline-delimited string to pandas dataframe

I have a string of the following format:
aString = '123\t456\t789\n321\t654\t987 ...'
And I would like to convert it to a pandas DataFrame
frame:
123 456 789
321 654 987
...
I have tried to convert it to a Python list:
stringList = aString.split('\n')
which results in:
stringList = ['123\t456\t789',
'321\t654\t987',
...
]
Have no idea what to do next.
one option is list comprehension with str.split
pd.DataFrame([x.split('\t') for x in stringList], columns=list('ABC'))
A B C
0 123 456 789
1 321 654 987
You can use StringIO
from io import StringIO
pd.read_csv(StringIO(aString), sep='\t', header=None)
0 1 2
0 123 456 789
1 321 654 987

Merge two DataFrames on multiple columns

hope you can help me.
I have two pretty big Datasets.
DF1 Example:
|id| A_Workflow_Type_ID | B_Workflow_Type_ID | ...
1 123 456
2 789 222 ...
3 333 NULL ...
DF2 Example:
Workflow| Operation | Profile | Type | Name | ...
123 1 2 Low_Cost xyz ...
456 2 5 High_Cost z ...
I need to merge the two datasets without creating many NaNs and multiple columns. So i merge on the informations A_Workflow_Type_ID and B_Workflow_Type_ID from DF1 on Workflow from DF2.
I tried it with several join operations in pandas and the merge option it failure.
My last try:
all_Data = pd.merge(left=DF1,right=DF2, how='inner', left_on =['A_Workflow_Type_ID ','B_Workflow_Type_ID '], right_on=['Workflow'])
But that returns an error that they have to be equal lenght on both sides.
Thanks for the help!
You need reshape first by melt and then merge:
#generate all column without strings Workflow
cols = DF1.columns[~DF1.columns.str.contains('Workflow')]
print (cols)
Index(['id'], dtype='object')
df = DF1.melt(cols, value_name='Workflow', var_name='type')
print (df)
id type Workflow
0 1 A_Workflow_Type_ID 123.0
1 2 A_Workflow_Type_ID 789.0
2 3 A_Workflow_Type_ID 333.0
3 1 B_Workflow_Type_ID 456.0
4 2 B_Workflow_Type_ID 222.0
5 3 B_Workflow_Type_ID NaN
all_Data = pd.merge(left=df,right=DF2, on ='Workflow')
print (all_Data)
id type Workflow Operation Profile Type Name
0 1 A_Workflow_Type_ID 123 1 2 Low_Cost xyz
1 1 B_Workflow_Type_ID 456 2 5 High_Cost z

How to count the no. of same strings in pandas dataframe

i have a dataframe like:
Company Date Country
ABC 2017-09-17 USA
BCD 2017-09-16 USA
ABC 2017-09-17 USA
BCD 2017-09-16 USA
BCD 2017-09-16 USA
I want to get a resultant df as :
Company No: of Days
ABC 2
BCD 3
How do i do it ?
You can use value_counts and rename_axis with reset_index:
df1 = df['Company'].value_counts()
.rename_axis('Company').reset_index(name='No: of Companies')
print (df1)
Company No: of Companies
0 BCD 3
1 ABC 2
Another solution with groupby and aggregating size, last reset_index:
df1 = df.groupby('Company').size().reset_index(name='No: of Companies')
print (df1)
Company No: of Companies
0 BCD 3
1 ABC 2
If need count Date columns:
df1 = df['Date'].value_counts().rename_axis('Date').reset_index(name='No: of Days')
print (df1)
Date No: of Days
0 2017-09-16 3
1 2017-09-17 2
df1 = df.groupby('Date').size().reset_index(name='No: of Days')
print (df1)
Date No: of Days
0 2017-09-16 3
1 2017-09-17 2
EDIT:
If need count pairs Date and Company columns:
df1 = df.groupby(['Date', 'Company']).size().reset_index(name='No: of Days per company')
print (df1)
Date Company No: of Days per company
0 2017-09-16 BCD 3
1 2017-09-17 ABC 2