Parsing python list of dates into a pandas DataFrame - pandas

need some help/advise how to wrangling dates into a Pandas DataFrame. I have Python list looking like this:
['',
'20180715:1700-20180716:1600',
'20180716:1700-20180717:1600',
'20180717:1700-20180718:1600',
'20180718:1700-20180719:1600',
'20180719:1700-20180720:1600',
'20180721:CLOSED',
'20180722:1700-20180723:1600',
'20180723:1700-20180724:1600',
'20180724:1700-20180725:1600',
'20180725:1700-20180726:1600',
'20180726:1700-20180727:1600',
'20180728:CLOSED']
Is there an easy way to transform this into a Pandas DataFrame with two columns (start time and end time)?

Sample:
L = ['',
'20180715:1700-20180716:1600',
'20180716:1700-20180717:1600',
'20180717:1700-20180718:1600',
'20180718:1700-20180719:1600',
'20180719:1700-20180720:1600',
'20180721:CLOSED',
'20180722:1700-20180723:1600',
'20180723:1700-20180724:1600',
'20180724:1700-20180725:1600',
'20180725:1700-20180726:1600',
'20180726:1700-20180727:1600',
'20180728:CLOSED']
I think best here is use list comprehension with split by separator and filter out values with no splitter:
df = pd.DataFrame([x.split('-') for x in L if '-' in x], columns=['start','end'])
print (df)
start end
0 20180715:1700 20180716:1600
1 20180716:1700 20180717:1600
2 20180717:1700 20180718:1600
3 20180718:1700 20180719:1600
4 20180719:1700 20180720:1600
5 20180722:1700 20180723:1600
6 20180723:1700 20180724:1600
7 20180724:1700 20180725:1600
8 20180725:1700 20180726:1600
9 20180726:1700 20180727:1600
Pandas solution is also possible, especially if need process Series - here is used split and dropna:
s = pd.Series(L)
df = s.str.split('-', expand=True).dropna(subset=[1])
df.columns = ['start','end']
print (df)
start end
1 20180715:1700 20180716:1600
2 20180716:1700 20180717:1600
3 20180717:1700 20180718:1600
4 20180718:1700 20180719:1600
5 20180719:1700 20180720:1600
7 20180722:1700 20180723:1600
8 20180723:1700 20180724:1600
9 20180724:1700 20180725:1600
10 20180725:1700 20180726:1600
11 20180726:1700 20180727:1600

Related

GroupBy-Apply even for empty DataFrame

I am using groupby-apply to create new DataFrame from given Data Frame. But if given DataFrame is empty result would look like given DataFrame with group keys not like target new DataFrame. So to get look of target new DataFrame I have to use if..else with length check and if given DataFrame is empty then manually create DataFrame with specified columns and indexes.
It is kinda broken flow of code. Also if in future structure of target DataFrame happen to change I would have to fix code in two places instead of one.
Is there a way to get look of a target DataFrame even if given DataFrame is empty with GroupBy only (or without if..else)?
Simplified example:
def some_func(df: pd.DataFrame):
return df.values.sum() + pd.DataFrame([[1,1,1], [2,2,2], [3,3,3]], columns=['new_col1', 'new_col2', 'new_col3'])
df1 = pd.DataFrame([[1,1], [1,2], [2,1], [2,2]], columns=['col1', 'col2'])
df2 = pd.DataFrame(columns=['col1', 'col2'])
df1_grouped = df1.groupby(['col1'], group_keys=False).apply(lambda df: some_func(df))
df2_grouped = df2.groupby(['col1'], group_keys=False).apply(lambda df: some_func(df))
Result for df1 is ok:
new_col1 new_col2 new_col3
0 6 6 6
1 7 7 7
2 8 8 8
0 8 8 8
1 9 9 9
2 10 10 10
And not ok for df2:
Empty DataFrame
Columns: [col1, col2]
Index: []
If..else to get expected result for df2:
df = df2
if df.empty:
df_grouped = pd.DataFrame(columns=['new_col1', 'new_col2', 'new_col3'])
else:
df_grouped = df.groupby(['col1'], group_keys=False).apply(lambda df: some_func(df))
Gives what I need:
Empty DataFrame
Columns: [new_col1, new_col2, new_col3]
Index: []

Multiplying two data frames in pandas

I have two data frames as shown below df1 and df2. I want to create a third dataframe i.e. df as shown below. What would be the appropriate way?
df1={'id':['a','b','c'],
'val':[1,2,3]}
df1=pd.DataFrame(df)
df1
id val
0 a 1
1 b 2
2 c 3
df2={'yr':['2010','2011','2012'],
'val':[4,5,6]}
df2=pd.DataFrame(df2)
df2
yr val
0 2010 4
1 2011 5
2 2012 6
df={'id':['a','b','c'],
'val':[1,2,3],
'2010':[4,8,12],
'2011':[5,10,15],
'2012':[6,12,18]}
df=pd.DataFrame(df)
df
id val 2010 2011 2012
0 a 1 4 5 6
1 b 2 8 10 12
2 c 3 12 15 18
I can basically convert df1 and df2 as 1 by n matrices and get n by n result and assign it back to the df1. But is there any easy pandas way?
TL;DR
We can do it in one line like this:
df1.join(df1.val.apply(lambda x: x * df2.set_index('yr').val))
or like this:
df1.join(df1.set_index('id') # df2.set_index('yr').T, on='id')
Done.
The long story
Let's see what's going on here.
To find the output of multiplication of each df1.val by values in df2.val we use apply:
df1['val'].apply(lambda x: x * df2.val)
The function inside will obtain df1.vals one by one and multiply each by df2.val element-wise (see broadcasting for details if needed). As far as df2.val is a pandas sequence, the output is a data frame with indexes df1.val.index and columns df2.val.index. By df2.set_index('yr') we force years to be indexes before multiplication so they will become column names in the output.
DataFrame.join is joining frames index-on-index by default. So due to identical indexes of df1 and the multiplication output, we can apply df1.join( <the output of multiplication> ) as is.
At the end we get the desired matrix with indexes df1.index and columns id, val, *df2['yr'].
The second variant with # operator is actually the same. The main difference is that we multiply 2-dimentional frames instead of series. These are the vertical and horizontal vectors, respectively. So the matrix multiplication will produce a frame with indexes df1.id and columns df2.yr and element-wise multiplication as values. At the end we connect df1 with the output on identical id column and index respectively.
This works for me:
df2 = df2.T
new_df = pd.DataFrame(np.outer(df1['val'],df2.iloc[1:]))
df = pd.concat([df1, new_df], axis=1)
df.columns = ['id', 'val', '2010', '2011', '2012']
df
The output I get:
id val 2010 2011 2012
0 a 1 4 5 6
1 b 2 8 10 12
2 c 3 12 15 18
Your question is a bit vague. But I suppose you want to do something like that:
df = pd.concat([df1, df2], axis=1)

How to sort range values in pandas dataframe?

Below is part of a range in dataframe that I work withI have tried to sort it using df.sort_values(['column']) but it doesn't work. I would appreciate advice.
You can simplify solution for sorting by values before - converting to integers in parameter key:
f = lambda x: x.str.split('-').str[0].str.replace(',','', regex=True).astype(int)
df = df.sort_values('column', key= f, ignore_index=True)
print (df)
column
0 1,000-1,999
1 2,000-2,949
2 3,000-3,999
3 4,000-4,999
4 5,000-7,499
5 10,000-14,999
6 15,000-19,999
7 20,000-24,999
8 25,000-29,999
9 30,000-39,999
10 40,000-49,999
11 103,000-124,999
12 125,000-149,999
13 150,000-199,999
14 200,000-249,999
15 250,000-299,999
16 300,000-499,999
Another idea is use first integers for sorting:
f = lambda x: x.str.extract('(\d+)', expand=False).astype(int)
df = df.sort_values('column', key= f, ignore_index=True)
If need sorting by both values splitted by - it is possible by:
f = lambda x: x.str.replace(',','', regex=True).str.extract('(\d+)-(\d+)', expand=True).astype(int).apply(tuple, 1)
df = df.sort_values('column', key= f, ignore_index=True)

pandas add one column to many others

I want to add the values of one column
import pandas as pd
df= pd.DataFrame(data={"a":[1,2],"b":[102,4], "c":[4,5]})
# what I intended to do
df[["a","b"]] = df[["a","b"]] + df[["c"]]
Expected result:
df["a"] = df["a"] + df["c"]
df["b"] = df["b"] + df["c"]
You can assume a list of columns is available (["a", "b"]). is there a non loop / non line by line way of doing this? must be...
Use DataFrame.add with axis=0 and select c column only one [] for Series:
df[["a","b"]] = df[["a","b"]].add(df["c"], axis=0)
print (df)
a b c
0 5 106 4
1 7 9 5

Create new nested column within dataframe

I have the following
df1 = pd.DataFrame({'data': [1,2,3]})
df2 = pd.DataFrame({'data': [4,5,6]})
df = pd.concat([df1,df2], keys=['hello','world'], axis=1)
What is the "proper" way of creating a new nested column (say, df['world']['data']*2) within the hello column? I have tried df['hello']['new_col'] = df['world']['data']*2 but this does not seem to work.
Use tuples for select and set MultiIndex:
df[('hello','new_col')] = df[('world','data')]*2
print (df)
hello world hello
data data new_col
0 1 4 8
1 2 5 10
2 3 6 12
Selecting like df['world']['data'] is not recommended - link, because possible chained indexing.