String to Columns - pandas

I have a string column in my df.
col
a: 1, b: 2, c: 3
b: 1, c: 3, a: 4
c: 2, b: 4, a: 3
I wish to convert this into multiple columns as:
a b c
1 2 3
4 1 3
3 4 2
Need help regarding this.
I am trying to convert this into a dict and then sort the dict. Post that, I want to maybe do a pivot table. Not exactly sure if it'll do but any help or better method will be appreciated.

Use nested list comprehension with double split by , and : for list of dictionaries and pass to DataFrame constructor:
df = pd.DataFrame([dict(y.split(': ') for y in x.split(', ')) for x in df['col']],
index=df.index)
print (df)
a b c
0 1 2 3
1 4 1 3
2 3 4 2

You can use str.extractall and unstack:
(df['col'].str.extractall('(\w+):\s*([^,]+)')
.set_index(0, append=True).droplevel('match')[1]
.unstack(0)
)
Output:
a b c
0 1 2 3
1 4 1 3
2 3 4 2

Related

Multimatch join in pandas

I am looking for joining two data frame on one column and if there is a multi match then append the results to another column.
NB. using a different example as yours is not reproducible.
You can convert to str.lower, then explode and map the values to groupby.agg again as string:
mapper = df2.set_index('name')['ID'].astype(str)
df1['ID'] = (df1['name']
.str.upper().str.split(',')
.explode()
.map(mapper)
.groupby(level=0).agg(','.join)
)
Or, with a list comprehension:
mapper = df2.set_index('name')['ID'].astype(str)
df1['ID'] = [','.join([mapper[x] for x in s.split(',') if x in mapper])
for s in df1['name']]
output:
name ID
0 A 1
1 b 2
2 A,B 1,2
3 C,a 3,1
4 D 4
Used input:
# df1
name
0 A
1 b
2 A,B
3 C,a
4 D
# df2
name ID
0 A 1
1 B 2
2 C 3
3 D 4

Python: obtaining the first observation according to its date [duplicate]

I have a DataFrame with columns A, B, and C. For each value of A, I would like to select the row with the minimum value in column B.
That is, from this:
df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2],
'B': [4, 5, 2, 7, 4, 6],
'C': [3, 4, 10, 2, 4, 6]})
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
I would like to get:
A B C
0 1 2 10
1 2 4 4
For the moment I am grouping by column A, then creating a value that indicates to me the rows I will keep:
a = data.groupby('A').min()
a['A'] = a.index
to_keep = [str(x[0]) + str(x[1]) for x in a[['A', 'B']].values]
data['id'] = data['A'].astype(str) + data['B'].astype('str')
data[data['id'].isin(to_keep)]
I am sure that there is a much more straightforward way to do this.
I have seen many answers here that use MultiIndex, which I would prefer to avoid.
Thank you for your help.
I feel like you're overthinking this. Just use groupby and idxmin:
df.loc[df.groupby('A').B.idxmin()]
A B C
2 1 2 10
4 2 4 4
df.loc[df.groupby('A').B.idxmin()].reset_index(drop=True)
A B C
0 1 2 10
1 2 4 4
Had a similar situation but with a more complex column heading (e.g. "B val") in which case this is needed:
df.loc[df.groupby('A')['B val'].idxmin()]
The accepted answer (suggesting idxmin) cannot be used with the pipe pattern. A pipe-friendly alternative is to first sort values and then use groupby with DataFrame.head:
data.sort_values('B').groupby('A').apply(DataFrame.head, n=1)
This is possible because by default groupby preserves the order of rows within each group, which is stable and documented behaviour (see pandas.DataFrame.groupby).
This approach has additional benefits:
it can be easily expanded to select n rows with smallest values in specific column
it can break ties by providing another column (as a list) to .sort_values(), e.g.:
data.sort_values(['final_score', 'midterm_score']).groupby('year').apply(DataFrame.head, n=1)
As with other answers, to exactly match the result desired in the question .reset_index(drop=True) is needed, making the final snippet:
df.sort_values('B').groupby('A').apply(DataFrame.head, n=1).reset_index(drop=True)
I found an answer a little bit more wordy, but a lot more efficient:
This is the example dataset:
data = pd.DataFrame({'A': [1,1,1,2,2,2], 'B':[4,5,2,7,4,6], 'C':[3,4,10,2,4,6]})
data
Out:
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
First we will get the min values on a Series from a groupby operation:
min_value = data.groupby('A').B.min()
min_value
Out:
A
1 2
2 4
Name: B, dtype: int64
Then, we merge this series result on the original data frame
data = data.merge(min_value, on='A',suffixes=('', '_min'))
data
Out:
A B C B_min
0 1 4 3 2
1 1 5 4 2
2 1 2 10 2
3 2 7 2 4
4 2 4 4 4
5 2 6 6 4
Finally, we get only the lines where B is equal to B_min and drop B_min since we don't need it anymore.
data = data[data.B==data.B_min].drop('B_min', axis=1)
data
Out:
A B C
2 1 2 10
4 2 4 4
I have tested it on very large datasets and this was the only way I could make it work in a reasonable time.
You can sort_values and drop_duplicates:
df.sort_values('B').drop_duplicates('A')
Output:
A B C
2 1 2 10
4 2 4 4
The solution is, as written before ;
df.loc[df.groupby('A')['B'].idxmin()]
If the solution but then if you get an error;
"Passing list-likes to .loc or [] with any missing labels is no longer supported.
The following labels were missing: Float64Index([nan], dtype='float64').
See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"
In my case, there were 'NaN' values at column B. So, I used 'dropna()' then it worked.
df.loc[df.groupby('A')['B'].idxmin().dropna()]
You can also boolean indexing the rows where B column is minimal value
out = df[df['B'] == df.groupby('A')['B'].transform('min')]
print(out)
A B C
2 1 2 10
4 2 4 4

python - List of Lists into pandas dataframe including name of columns

I would like to transfer a list of lists into a dataframe with columns based on the lists in the list.
This is still easy.
list = [[....],[....],[...]]
df = pd.DataFrame(list)
df = df.transpose()
The problem is: I would like to give the columns a column-name based on entries I have in another list:
list_two = [A,B,C,...]
This is my issue Im still struggling with.
Is there any approach to solve this problem?
Thanks a lot in advance for your help.
Best regards
Sascha
Use zip with dict for dictionary of lists and pass to DataFrame:
L= [[1,2,3,5],[4,8,9,8],[1,2,5,3]]
list_two = list('ABC')
df = pd.DataFrame(dict(zip(list_two, L)))
print (df)
A B C
0 1 4 1
1 2 8 2
2 3 9 5
3 5 8 3
Or if pass index parameter after transpose get columns names by this list:
df = pd.DataFrame(L, index=list_two).T
print (df)
A B C
0 1 4 1
1 2 8 2
2 3 9 5
3 5 8 3

most efficient way to set dataframe column indexing to other columns

I have a large Dataframe. One of my columns contains the name of others. I want to eval this colum and set in each row the value of the referenced column:
|A|B|C|Column|
|:|:|:|:-----|
|1|3|4| B |
|2|5|3| A |
|3|5|9| C |
Desired output:
|A|B|C|Column|
|:|:|:|:-----|
|1|3|4| 3 |
|2|5|3| 2 |
|3|5|9| 9 |
I am achieving this result using:
df.apply(lambda d: eval("d." + d['Column']), axis=1)
But it is very slow, even using swifter. Is there a more efficient way of performing this?
For better performance, use df.to_numpy():
In [365]: df['Column'] = df.to_numpy()[df.index, df.columns.get_indexer(df.Column)]
In [366]: df
Out[366]:
A B C Column
0 1 3 4 3
1 2 5 3 2
2 3 5 9 9
For Pandas < 1.2.0, use lookup:
df['Column'] = df.lookup(df.index, df['Column'])
From 1.2.0+, lookup is decprecated, you can just use a for loop:
df['Column'] = [df.at[idx, r['Column']] for idx, r in df.iterrows()]
Output:
A B C Column
0 1 3 4 3
1 2 5 3 2
2 3 5 9 9
Since lookup is going to decprecated try numpy method with get_indexer
df['new'] = df.values[df.index,df.columns.get_indexer(df.Column)]
df
Out[75]:
A B C Column new
0 1 3 4 B 3
1 2 5 3 A 2
2 3 5 9 C 9

Append new column to DF after sum?

I have a sample dataframe below:
sn C1-1 C1-2 C1-3 H2-1 H2-2 K3-1 K3-2
1 4 3 5 4 1 4 2
2 2 2 0 2 0 1 2
3 1 2 0 0 2 1 2
I will like to sum based on the prefix of C1, H2, K3 and output three new columns with the total sum. The final result is this:
sn total_c1 total_h2 total_k3
1 12 5 6
2 4 2 3
3 3 2 3
What I have tried on my original df:
lst = ["C1", "H2", "K3"]
lst2 = ["total_c1", "total_h2", "total_k3"]
for k in lst:
idx = df.columns.str.startswith(i)
for j in lst2:
df[j] = df.iloc[:,idx].sum(axis=1)
df1 = df.append(df, sort=False)
But I kept getting error
IndexError: Item wrong length 35 instead of 36.
I can't figure out how to append the new total column to produce my end result in the loop.
Any help will be appreciated (or better suggestion as oppose to loop). Thank you.
You can use groupby:
# columns of interest
cols = df.columns[1:]
col_groups = cols.str.split('-').str[0]
out_df = df[['sn']].join(df[cols].groupby(col_groups, axis=1)
.sum()
.add_prefix('total_')
)
Output:
sn total_C1 total_H2 total_K3
0 1 12 5 6
1 2 4 2 3
2 3 3 2 3
Let us try ,split then groupby with it with axis=1
out = df.groupby(df.columns.str.split('-').str[0],axis=1).sum().set_index('sn').add_prefix('Total_').reset_index()
Out[84]:
sn Total_C1 Total_H2 Total_K3
0 1 12 5 6
1 2 4 2 3
2 3 3 2 3
Another option, where we create a dictionary to groupby the columns:
mapping = {entry: f"total_{entry[:2]}" for entry in df.columns[1:]}
result = df.groupby(mapping, axis=1).sum()
result.insert(0, "sn", df.sn)
result
sn total_C1 total_H2 total_K3
0 1 12 5 6
1 2 4 2 3
2 3 3 2 3