How to split dict in dataframe to many columns - pandas

I'm using dataframe. How to split dict list to many columns?
This is for a junior dataprocessor. In the past, I've tried on many ways.
import pandas as pd
l = [{'a':1,'b':2},{'a':3,'b':4}]
data = [{'key1':'x','key2':'y','value':l}]
df = pd.DataFrame(data)
data1 = {'key1':['x','x'],'key2':['y','y'],'a':[1,3],'b':[2,4]}
df1 = pd.DataFrame(data1)
df1 is what I need.

comprehension
d1 = df.drop('value', axis=1)
co = d1.columns
d2 = df.value
pd.DataFrame([
{**dict(zip(co, tup)), **d}
for tup, D in zip(zip(*map(d1.get, d1)), d2)
for d in D
])
a b key1 key2
0 1 2 x y
1 3 4 x y
Explode
See post on explode
This is a tad different but close
idx = df.index.repeat(df.value.str.len())
val = np.concatenate(df.value).tolist()
d0 = pd.DataFrame(val)
df.drop('value', axis=1).loc[idx].reset_index(drop=True).join(d0)
a b key1 key2
0 1 2 x y
1 3 4 x y

Related

Why this iteration doesn't change correctly the global DF Variable

I have a similar code from the one below for my job and I don't know why it doesn't change correctly the global DF variables by a nested array for loop.
>> df1 = pd.DataFrame({
>> 'x': [1,2,3,4,5],
>> 'y': ['a', 'b', 'c', 'd', 'e']
>> })
>> df2 = df1
>> for array in [[df1, 9], [df2, 'z']]:
>> array[0]['x'] = array[1]
>> array[0]['y'] = array[1]
>> print(array[0])
x y
0 9 9
1 9 9
2 9 9
3 9 9
4 9 9
x y
0 z z
1 z z
2 z z
3 z z
4 z z
>> print(df1)
x y
0 z z
1 z z
2 z z
3 z z
4 z z
>> print(df2)
x y
0 z z
1 z z
2 z z
3 z z
4 z z
So in the first iteration we see the correct changes, df1 with 9 in both columns and df2 with z in both columns.
But then when we check the global variables we see everything as z, even the df1. And I don't know why.
When an object in python is mutable, you copy by reference and not by value. For example; int and str are immutable object types, but list, dict and pandas.DataFrame are mutable. See the below example for int and list what this means:
a = 1
b = a
b += 1
print(a)
# >> 1
x = [1,2,3]
y = x
y.append(4)
print(x)
# >> [1, 2, 3, 4]
So, when you assigned df2, you assigned it to the exact same object as where df1 was referring to. That means, that when you change df2, you also change the object that is referred to by df1, because it is physically the same object. You can check this by using the inbuilt id() function:
df1 = pd.DataFrame({'x': [1,2,3,4,5], 'y': ['a', 'b', 'c', 'd', 'e']})
df2 = df1
print(id(df1), id(df2))
# >> 4695746416 4695746416
To have a new copy of the same dataframe, you need to use copy():
df1 = pd.DataFrame({'x': [1,2,3,4,5], 'y': ['a', 'b', 'c', 'd', 'e']})
df2 = df1.copy()
print(id(df1), id(df2))
# >> 4695749728 4695742816

Merge rows with same id, different vallues in 1 column to multiple columns

what i have length can be of different values/ so somethimes 1 id has 4 rows with different values in column val, the other columns have all the same values
df1 = pd.DataFrame({'id':[1,1,1,2,2,2,3,3,3], 'val': ['06123','nick','#gmail','06454','abey','#gmail','06888','sisi'], 'media': ['nrc','nrc','nrc','nrc','nrc','nrc','nrc','nrc']})
what i need
id kolom 1 kolom2 kolom 3 media
1 06123 nick #gmail nrc
2 06454 abey #gmail nrc
3 6888 sisi None nrc
I hope I gave a good example, in the corrected way, thanks for the help
df2 = df1.groupby('id').agg(list)
df2['col 1'] = df2['val'].apply(lambda x: x[0] if len(x) > 0 else 'None')
df2['col 2'] = df2['val'].apply(lambda x: x[1] if len(x) > 1 else 'None')
df2['col 3'] = df2['val'].apply(lambda x: x[2] if len(x) > 2 else 'None')
df2['media'] = df2['media'].apply(lambda x: x[0] if len(x) > 0 else 'None')
df2.drop(columns='val')
Here is another way. Since your original dataframe doesn't have lists with the same length (which will get you a ValueError, you can define it as:
data = {"id":[1,1,1,2,2,2,3,3,3],
"val": ["06123","nick","#gmail","06454","abey","#gmail","06888","sisi"],
"media": ["nrc","nrc","nrc","nrc","nrc","nrc","nrc","nrc"]}
df = pd.DataFrame.from_dict(data, orient="index")
df = df.transpose()
>>> df
id val media
0 1 06123 nrc
1 1 nick nrc
2 1 #gmail nrc
3 2 06454 nrc
4 2 abey nrc
5 2 #gmail nrc
6 3 06888 nrc
7 3 sisi nrc
8 3 NaN NaN
Afterwards, you can replace with np.nan values with an empty string, so that you can groupby your id column and join the values in val separated by a ,.
df = df.replace(np.nan, "", regex=True)
df_new = df.groupby(["id"])["val"].apply(lambda x: ",".join(x)).reset_index()
>>> df_new
id val
0 1.0 06123,nick,#gmail
1 2.0 06454,abey,#gmail
2 3.0 06888,sisi,
Then, you only need to transform the new val column into 3 columns by splitting the string inside, with any method you want. For example,
new_cols = df_new["val"].str.split(",", expand=True) # Good ol' split
df_new["kolom 1"] = new_cols[0] # Assign to new columns
df_new["kolom 2"] = new_cols[1]
df_new["kolom 3"] = new_cols[2]
df_new.drop("val", 1, inplace=True) # Delete previous val
df_new["media"] = "nrc" # Add the media column again
df_new = df_new.replace("", np.nan, regex=True) # If necessary, replace empty string with np.nan
>>> df_new
id kolom 1 kolom 2 kolom 3 media
0 1.0 06123 nick #gmail nrc
1 2.0 06454 abey #gmail nrc
2 3.0 06888 sisi NaN nrc

Apply function on a two dataframe rows

Given a pandas dataframe like this:
df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
col1 col2
0 1 4
1 2 5
2 3 6
I would like to do something equivalent to this using a function but without passing "by value" or as a global variable the whole dataframe (it could be huge and then it would give me a memory error):
i = -1
for index, row in df.iterrows():
if i < 0:
i = index
continue
c1 = df.loc[i][0] + df.loc[index][0]
c2 = df.loc[i][1] + df.loc[index][1]
df.ix[index, 0] = c1
df.ix[index, 1] = c2
i = index
col1 col2
0 1 4
1 3 9
2 6 15
i.e., I would like to have a function which will give me the previous output:
def my_function(two_rows):
row1 = two_rows[0]
row2 = two_rows[1]
c1 = row1[0] + row2[0]
c2 = row1[1] + row2[1]
row2[0] = c1
row2[1] = c2
return row2
df.apply(my_function, axis=1)
df
col1 col2
0 1 4
1 3 9
2 6 15
Is there a way of doing this?
What you've demonstrated is a cumsum
df.cumsum()
col1 col2
0 1 4
1 3 9
2 6 15
def f(df):
n = len(df)
r = range(1, n)
for j in df.columns:
for i in r:
df[j].values[i] += df[j].values[i - 1]
return df
f(df)
To define a function as a loop that does this in place
Slow cell by cell
def f(df):
n = len(df)
r = range(1, n)
for j in df.columns:
for i in r:
df[j].values[i] += df[j].values[i - 1]
return df
f(df)
col1 col2
0 1 4
1 3 9
2 6 15
Compromise between memory and efficiency
def f(df):
for j in df.columns:
df[j].values[:] = df[j].values.cumsum()
return df
f(df)
f(df)
col1 col2
0 1 4
1 3 9
2 6 15
Note that you don't need to return df. I chose to for convenience.

Swap certain subset of column data

I'm trying to swap a subset of the data in two columns, but all the methods that I have found on SO give a full swap, or also swap the column names. This is what I would like:
df =
a b c
0 1 2 3
1 1 2 3
2 1 2 3
3 1 2 3
Then I create a random mask:
mask = np.random.choice([False, True], len(df), p=[0.5, 0.5])
Applying the mask and the swap, I want the result to look like this if I swap df[mask]['a'] and df[mask]['b']:
df =
a b c
0 1 2 3
1 2 1 3
2 1 2 3
3 2 1 3
What is the best way to achieve this result? I am using pandas 0.18.1
In one line:
mask = np.random.choice([False, True], len(df), p=[0.5, 0.5])
df.loc[mask, ['a', 'b']] = df.loc[mask, ['b', 'a']].values
Solution with numpy.where:
mask = np.random.choice([False, True], len(df), p=[0.5, 0.5])
df[['b', 'a']] = np.where(mask[:, None], df[['b', 'a']], df[['a', 'b']])
print (df)
a b c
0 1 2 3
1 2 1 3
2 2 1 3
3 2 1 3
You can try this
import pandas as pd
import numpy as np
df = pd.DataFrame({"a":[1]*4, "b":[2]*4})
mask = np.random.choice([False, True], len(df), p=[0.5, 0.5])
df["a_bk"] = df["a"].copy()
df["a"] = np.where(mask, df["b"], df["a"])
df["b"] = np.where(mask, df["a"], df["b"])
del df["a_bk"]

How to build column by column dataframe pandas

I have a dataframe looking like this example
A | B | C
__|___|___
s s nan
nan x x
I would like to create a table of intersections between columns like this
| A | B | C
__|______|____|______
A | True |True| False
__|______|____|______
B | True |True|True
__|______|____|______
C | False|True|True
__|______|____|______
Is there an elegant cycle-free way to do it?
Thank you!
Setup
df = pd.DataFrame(dict(A=['s', np.nan], B=['s', 'x'], C=[np.nan, 'x']))
Option 1
You can use numpy broadcasting to evaluate each column by each other column. Then determine if any of the comparisons are True
v = df.values
pd.DataFrame(
(v[:, :, None] == v[:, None]).any(0),
df.columns, df.columns
)
A B C
A True True False
B True True True
C False True True
By replacing any with sum you can get a count of how many intersections.
v = df.values
pd.DataFrame(
(v[:, :, None] == v[:, None]).sum(0),
df.columns, df.columns
)
A B C
A 1 1 0
B 1 2 1
C 0 1 1
Or use np.count_nonzero instead of sum
v = df.values
pd.DataFrame(
np.count_nonzero(v[:, :, None] == v[:, None], 0),
df.columns, df.columns
)
A B C
A 1 1 0
B 1 2 1
C 0 1 1
Option 2
Fun & Creative way
d = pd.get_dummies(df.stack()).unstack(fill_value=0)
d = d.T.dot(d)
d.groupby(level=1).sum().groupby(level=1, axis=1).sum()
A B C
A 1 1 0
B 1 2 1
C 0 1 1