How to sort range values in pandas dataframe? - pandas

Below is part of a range in dataframe that I work withI have tried to sort it using df.sort_values(['column']) but it doesn't work. I would appreciate advice.

You can simplify solution for sorting by values before - converting to integers in parameter key:
f = lambda x: x.str.split('-').str[0].str.replace(',','', regex=True).astype(int)
df = df.sort_values('column', key= f, ignore_index=True)
print (df)
column
0 1,000-1,999
1 2,000-2,949
2 3,000-3,999
3 4,000-4,999
4 5,000-7,499
5 10,000-14,999
6 15,000-19,999
7 20,000-24,999
8 25,000-29,999
9 30,000-39,999
10 40,000-49,999
11 103,000-124,999
12 125,000-149,999
13 150,000-199,999
14 200,000-249,999
15 250,000-299,999
16 300,000-499,999
Another idea is use first integers for sorting:
f = lambda x: x.str.extract('(\d+)', expand=False).astype(int)
df = df.sort_values('column', key= f, ignore_index=True)
If need sorting by both values splitted by - it is possible by:
f = lambda x: x.str.replace(',','', regex=True).str.extract('(\d+)-(\d+)', expand=True).astype(int).apply(tuple, 1)
df = df.sort_values('column', key= f, ignore_index=True)

Related

pandas return auxilliary column from groupby and max

I have a pandas DataFrame with 3 columns, A, B, and V.
I want a DataFrame with A as the index and one column, which contains the B for the maximum V
I can easily create a df with A and the maximum V using groupby, and then perform some machinations to extract the corresponding B, but that seems like the wrong idea.
I've been playing with combinations of groupby and agg with no joy.
Sample Data:
A,B,V
MHQ,Q,0.5192
MMO,Q,0.4461
MTR,Q,0.5385
MVM,Q,0.351
NCR,Q,0.0704
MHQ,E,0.5435
MMO,E,0.4533
MTR,E,-0.6716
MVM,E,0.3684
NCR,E,-0.0278
MHQ,U,0.2712
MMO,U,0.1923
MTR,U,0.3833
MVM,U,0.1355
NCR,U,0.1058
A = [1,1,1,2,2,2,3,3,3,4,4,4]
B = [1,2,3,4,5,6,7,8,9,10,11,12]
V = [21,22,23,24,25,26,27,28,29,30,31,32]
df = pd.DataFrame({'A': A, 'B': B, 'V': V})
res = df.groupby('A').apply(
lambda x: x[x['V']==x['V'].max()]).set_index('A')['B'].to_frame()
res
B
A
1 3
2 6
3 9
4 12

pandas add one column to many others

I want to add the values of one column
import pandas as pd
df= pd.DataFrame(data={"a":[1,2],"b":[102,4], "c":[4,5]})
# what I intended to do
df[["a","b"]] = df[["a","b"]] + df[["c"]]
Expected result:
df["a"] = df["a"] + df["c"]
df["b"] = df["b"] + df["c"]
You can assume a list of columns is available (["a", "b"]). is there a non loop / non line by line way of doing this? must be...
Use DataFrame.add with axis=0 and select c column only one [] for Series:
df[["a","b"]] = df[["a","b"]].add(df["c"], axis=0)
print (df)
a b c
0 5 106 4
1 7 9 5

Pandas data frame creation using static data

I have a data set like this : {'IT',[1,20,35,44,51,....,1000]}
I want to convert this into python/pandas data frame. I want to see output in the below format. How to achieve this output.
Dept Count
IT 1
IT 20
IT 35
IT 44
IT 51
.. .
.. .
.. .
IT 1000
Below way i can write, but this is not efficient way for huge data.
data = [['IT',1],['IT',2],['IT',3]]
df = pd.DataFrame(data,columns=['Dept','Count'])
print(df)
No need for a list comprehension since pandas will automatically fill IT in for every row.
import pandas as pd
d = {'IT':[1,20,35,44,51,1000]}
df = pd.DataFrame({'dept': 'IT', 'count': d['IT']})
Use list comprehension for tuples and pass to DataFrame constructor:
d = {'IT':[1,20,35,44,51], 'NEW':[1000]}
data = [(k, x) for k, v in d.items() for x in v]
df = pd.DataFrame(data,columns=['Dept','Count'])
print(df)
Dept Count
0 IT 1
1 IT 20
2 IT 35
3 IT 44
4 IT 51
5 NEW 1000
You can use melt
import pandas as pd
d = {'IT': [10]*100000}
df = pd.DataFrame(d)
df = pd.melt(df, var_name='Dept', value_name='Count')

Parsing python list of dates into a pandas DataFrame

need some help/advise how to wrangling dates into a Pandas DataFrame. I have Python list looking like this:
['',
'20180715:1700-20180716:1600',
'20180716:1700-20180717:1600',
'20180717:1700-20180718:1600',
'20180718:1700-20180719:1600',
'20180719:1700-20180720:1600',
'20180721:CLOSED',
'20180722:1700-20180723:1600',
'20180723:1700-20180724:1600',
'20180724:1700-20180725:1600',
'20180725:1700-20180726:1600',
'20180726:1700-20180727:1600',
'20180728:CLOSED']
Is there an easy way to transform this into a Pandas DataFrame with two columns (start time and end time)?
Sample:
L = ['',
'20180715:1700-20180716:1600',
'20180716:1700-20180717:1600',
'20180717:1700-20180718:1600',
'20180718:1700-20180719:1600',
'20180719:1700-20180720:1600',
'20180721:CLOSED',
'20180722:1700-20180723:1600',
'20180723:1700-20180724:1600',
'20180724:1700-20180725:1600',
'20180725:1700-20180726:1600',
'20180726:1700-20180727:1600',
'20180728:CLOSED']
I think best here is use list comprehension with split by separator and filter out values with no splitter:
df = pd.DataFrame([x.split('-') for x in L if '-' in x], columns=['start','end'])
print (df)
start end
0 20180715:1700 20180716:1600
1 20180716:1700 20180717:1600
2 20180717:1700 20180718:1600
3 20180718:1700 20180719:1600
4 20180719:1700 20180720:1600
5 20180722:1700 20180723:1600
6 20180723:1700 20180724:1600
7 20180724:1700 20180725:1600
8 20180725:1700 20180726:1600
9 20180726:1700 20180727:1600
Pandas solution is also possible, especially if need process Series - here is used split and dropna:
s = pd.Series(L)
df = s.str.split('-', expand=True).dropna(subset=[1])
df.columns = ['start','end']
print (df)
start end
1 20180715:1700 20180716:1600
2 20180716:1700 20180717:1600
3 20180717:1700 20180718:1600
4 20180718:1700 20180719:1600
5 20180719:1700 20180720:1600
7 20180722:1700 20180723:1600
8 20180723:1700 20180724:1600
9 20180724:1700 20180725:1600
10 20180725:1700 20180726:1600
11 20180726:1700 20180727:1600

How to pick a row in a dataframe and make it the columns name? Pandas

I want to make row 3 column index
The quick and easy answer is
df.T.set_index(3).T
I think you need select row with loc and drop this row from df:
df = pd.DataFrame({'A':['Groups'], 'B':['Quantity'], 'C':['Net Sales']}, index=[3])
df.columns = df.loc[3]
df = df.drop(3)
print (df)
Empty DataFrame
Columns: [Groups, Quantity, Net Sales]
Index: []
But better is avoid it e.g. use parameter skiprows if use read_csv for get DataFrame, main advantage is read_csv get right dtypes of all columns:
import pandas as pd
from pandas.compat import StringIO
temp=u"""A,B,C
D,E,F
G,H,I
J,K,L
Groups Quantity,Net,Sales
4,6,4"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp))
print (df)
A B C
0 D E F
1 G H I
2 J K L
3 Groups Quantity Net Sales
4 4 6 4
df = pd.read_csv(StringIO(temp), skiprows=4)
print (df)
Groups Quantity Net Sales
0 4 6 4
Timings:
In [319]: %timeit (df.T.set_index(3).T.reset_index(drop=True).astype(float).rename_axis(None, 1))
10 loops, best of 3: 43.1 ms per loop
In [320]: %timeit (jez(df))
10 loops, best of 3: 23.7 ms per loop
In [321]: %timeit (jez1(df))
100 loops, best of 3: 13.6 ms per loop
Code for timings:
Also is added converting to float to all solutions, if all data re strings then it is not necessary.
np.random.seed(100)
df = pd.DataFrame(np.random.random((100000,3)), columns=list('ABC'))
df = df.drop([0,1,2])
df.loc[3] = ['Groups', 'Quantity', 'Net Sales']
print (df)
print (df.T.set_index(3).T.reset_index(drop=True).astype(float).rename_axis(None, 1))
def jez(df):
df.columns = df.loc[3]
return df.drop(3).reset_index(drop=True).astype(float).rename_axis(None, 1)
def jez1(df):
arr = df.values
#get position (number of row) with 3
idx = df.index.get_loc(3)
return pd.DataFrame(np.delete(arr, (idx), axis=0).astype(float), columns=arr[idx])