check sorting by year and quarter pandas dataframe

check sorting by year and quarter pandas dataframe - pandas

I have a df that looks like below
date col1 col2
0 2000 Q1 123 456
1 2000 Q2 234 567
2 2000 Q3 345 678
3 2000 Q4 456 789
4 2001 Q1 567 890
The df has over 200 rows. I need to -
check if the data is sorted by date
if not, then sort it by date
Can someone please help me with this?
Many thanks

Use DataFrame.sort_values with key parameter and converting values to datetimes:
df = df.sort_values('date', key=lambda x: pd.to_datetime(x.str.replace('\s+', '')))
print (df)
date col1 col2
0 2000 Q1 123 456
1 2000 Q2 234 567
2 2000 Q3 345 678
3 2000 Q4 456 789
4 2001 Q1 567 890
EDIT: You can use Series.is_monotonic for test if values are monotonic_increasing:
if not df['date'].is_monotonic:
df = df.sort_values('date', key=lambda x: pd.to_datetime(x.str.replace('\s+', '')))

You can convert your date column as pd.Index (or define it as the index of your dataframe):
if not pd.Index(df['date']).is_monotonic_increasing:
df = df.sort_values('date')

Related

Duplicated IDs pandas

I have the following dataframes (df1,df2):
ID
Q1
111
2
111
3
112
1
ID
Q2
111
1
111
5
112
7
Since the IDs are duplicated, I want to reinitialize them, using the following code:
df1.sort_values('ID',inplace=True)
df1['ID_new'] = range(len(df1))
df2.sort_values('ID',inplace=True)
df2['ID_new'] = range(len(df2))
in order to have smth like this:
ID_new
ID
Q1
0
111
2
1
111
3
2
112
1
ID_new
ID
Q2
0
111
1
1
111
5
2
112
7
The question is: are we sure that the ID_new will be the same for df1 and df2?
For example:
is it possible that ID_new = 1 corresponds to the first ID=111 in df1 and to the second ID = 111 in df2?
If yes, there is another way to reinitialize it in a more robust way?

Get the max, min value and the respective owner of that max, min price - Pandas

I have this df:
df = pd.DataFrame({'code_1':[123,456,789,456,123],
'code_2':[333,666,999,666,333],
'seller':['Andres','Mike','Paul','Andy','Mike'],
'price':[12.2,23.51,12.34,10.2,13.0]})
and I want to get this as output:
dff = pd.DataFrame({'code_1':[123,456,789],
'code_2':[333,666,999],
'seller_maxprice':['Mike','Mike','Paul'],
'max_price':[13,23.51,12.34],
'avg_price':[12.6,16.85,12.34],
'seller_mainprice':['Andres','Andy','Paul'],
'min_price':[12.2,10.20,12.34]})
So, from DF I want to get the maxprice, minprice, avgprice, sellermaxprice and sellerminprice of cod_1 and/or cod_2
And there´s one more issue: I dont have both unique indentifier: cod_1 and cod_2 all the time. Sometimes I Have cod_1 and cod_2 is null and vice-versa
I tried:
result = df.groupby(['code_1']).agg({'price': ['mean', 'min', 'max']})
But this approach has two issues:
1: will fail when code_1 is null
2: Does not shoe the seller max and seller min_price
an anyone help me with this?

Your approach is the right one, we can improve it in a number of small ways:
If you set seller as the index, then your seller min/maxprice become idxmin and idxmax − it’s OK for the index to be non-unique
If you expect code_1 to be null, you can use fillna to add some placeholder values
If you use ['price'] after .groupby you’ll get a SeriesGroupBy instead of a DataFrameGroupBy which means you don’t have to specify which columns to aggregate in .agg()
You can define the column names as keys in .agg()
>>> df.set_index('seller').fillna({'code_1': 'missing'})\
... .groupby(['code_1', 'code_2'])['price']\
... .agg(maxprice='max', minprice='min', aver_price='mean',
... seller_minprice='idxmin', seller_maxprice='idxmax',
... ).reset_index()
...
code_1 code_2 maxprice minprice aver_price seller_minprice seller_maxprice
0 123 333 13.00 12.20 12.600 Andres Mike
1 456 666 23.51 10.20 16.855 Andy Mike
2 789 999 12.34 12.34 12.340 Paul Paul

You can do that with groupby and apply
dff = df.groupby(['code_1','code_2']).apply(lambda x : pd.Series({'seller_maxprice':x.loc[x['price'].idxmax(),'seller'],
'max_price':x['price'].max(),
'seller_minprice': x.loc[x['price'].idxmin(), 'seller'],
'min_price': x['price'].min(),
'aver_price':x['price'].mean()
})).reset_index()
Out[22]:
code_1 code_2 seller_maxprice ... seller_minprice min_price aver_price
0 123 333 Mike ... Andres 12.20 12.600
1 456 666 Mike ... Andy 10.20 16.855
2 789 999 Paul ... Paul 12.34 12.340
[3 rows x 7 columns]

Convert a tab- and newline-delimited string to pandas dataframe

I have a string of the following format:
aString = '123\t456\t789\n321\t654\t987 ...'
And I would like to convert it to a pandas DataFrame
frame:
123 456 789
321 654 987
...
I have tried to convert it to a Python list:
stringList = aString.split('\n')
which results in:
stringList = ['123\t456\t789',
'321\t654\t987',
...
]
Have no idea what to do next.

one option is list comprehension with str.split
pd.DataFrame([x.split('\t') for x in stringList], columns=list('ABC'))
A B C
0 123 456 789
1 321 654 987
You can use StringIO
from io import StringIO
pd.read_csv(StringIO(aString), sep='\t', header=None)
0 1 2
0 123 456 789
1 321 654 987

Merge two DataFrames on multiple columns

hope you can help me.
I have two pretty big Datasets.
DF1 Example:
|id| A_Workflow_Type_ID | B_Workflow_Type_ID | ...
1 123 456
2 789 222 ...
3 333 NULL ...
DF2 Example:
Workflow| Operation | Profile | Type | Name | ...
123 1 2 Low_Cost xyz ...
456 2 5 High_Cost z ...
I need to merge the two datasets without creating many NaNs and multiple columns. So i merge on the informations A_Workflow_Type_ID and B_Workflow_Type_ID from DF1 on Workflow from DF2.
I tried it with several join operations in pandas and the merge option it failure.
My last try:
all_Data = pd.merge(left=DF1,right=DF2, how='inner', left_on =['A_Workflow_Type_ID ','B_Workflow_Type_ID '], right_on=['Workflow'])
But that returns an error that they have to be equal lenght on both sides.
Thanks for the help!

You need reshape first by melt and then merge:
#generate all column without strings Workflow
cols = DF1.columns[~DF1.columns.str.contains('Workflow')]
print (cols)
Index(['id'], dtype='object')
df = DF1.melt(cols, value_name='Workflow', var_name='type')
print (df)
id type Workflow
0 1 A_Workflow_Type_ID 123.0
1 2 A_Workflow_Type_ID 789.0
2 3 A_Workflow_Type_ID 333.0
3 1 B_Workflow_Type_ID 456.0
4 2 B_Workflow_Type_ID 222.0
5 3 B_Workflow_Type_ID NaN
all_Data = pd.merge(left=df,right=DF2, on ='Workflow')
print (all_Data)
id type Workflow Operation Profile Type Name
0 1 A_Workflow_Type_ID 123 1 2 Low_Cost xyz
1 1 B_Workflow_Type_ID 456 2 5 High_Cost z

How to count the no. of same strings in pandas dataframe

i have a dataframe like:
Company Date Country
ABC 2017-09-17 USA
BCD 2017-09-16 USA
ABC 2017-09-17 USA
BCD 2017-09-16 USA
BCD 2017-09-16 USA
I want to get a resultant df as :
Company No: of Days
ABC 2
BCD 3
How do i do it ?

You can use value_counts and rename_axis with reset_index:
df1 = df['Company'].value_counts()
.rename_axis('Company').reset_index(name='No: of Companies')
print (df1)
Company No: of Companies
0 BCD 3
1 ABC 2
Another solution with groupby and aggregating size, last reset_index:
df1 = df.groupby('Company').size().reset_index(name='No: of Companies')
print (df1)
Company No: of Companies
0 BCD 3
1 ABC 2
If need count Date columns:
df1 = df['Date'].value_counts().rename_axis('Date').reset_index(name='No: of Days')
print (df1)
Date No: of Days
0 2017-09-16 3
1 2017-09-17 2
df1 = df.groupby('Date').size().reset_index(name='No: of Days')
print (df1)
Date No: of Days
0 2017-09-16 3
1 2017-09-17 2
EDIT:
If need count pairs Date and Company columns:
df1 = df.groupby(['Date', 'Company']).size().reset_index(name='No: of Days per company')
print (df1)
Date Company No: of Days per company
0 2017-09-16 BCD 3
1 2017-09-17 ABC 2

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

check sorting by year and quarter pandas dataframe - pandas

I have a df that looks like below date col1 col2 0 2000 Q1 123 456 1 2000 Q2 234 567 2 2000 Q3 345 678 3 2000 Q4 456 789 4 2001 Q1 567 890 The df has over 200 rows. I need to - check if the data is sorted by date if not, then sort it by date Can someone please help me with this? Many thanks

You can convert your date column as pd.Index (or define it as the index of your dataframe): if not pd.Index(df['date']).is_monotonic_increasing: df = df.sort_values('date')

Related

Duplicated IDs pandas

Get the max, min value and the respective owner of that max, min price - Pandas

Convert a tab- and newline-delimited string to pandas dataframe

Merge two DataFrames on multiple columns

How to count the no. of same strings in pandas dataframe

Categories

Resources