Merge rows with same id, different vallues in 1 column to multiple columns - pandas

what i have length can be of different values/ so somethimes 1 id has 4 rows with different values in column val, the other columns have all the same values
df1 = pd.DataFrame({'id':[1,1,1,2,2,2,3,3,3], 'val': ['06123','nick','#gmail','06454','abey','#gmail','06888','sisi'], 'media': ['nrc','nrc','nrc','nrc','nrc','nrc','nrc','nrc']})
what i need
id kolom 1 kolom2 kolom 3 media
1 06123 nick #gmail nrc
2 06454 abey #gmail nrc
3 6888 sisi None nrc
I hope I gave a good example, in the corrected way, thanks for the help

df2 = df1.groupby('id').agg(list)
df2['col 1'] = df2['val'].apply(lambda x: x[0] if len(x) > 0 else 'None')
df2['col 2'] = df2['val'].apply(lambda x: x[1] if len(x) > 1 else 'None')
df2['col 3'] = df2['val'].apply(lambda x: x[2] if len(x) > 2 else 'None')
df2['media'] = df2['media'].apply(lambda x: x[0] if len(x) > 0 else 'None')
df2.drop(columns='val')

Here is another way. Since your original dataframe doesn't have lists with the same length (which will get you a ValueError, you can define it as:
data = {"id":[1,1,1,2,2,2,3,3,3],
"val": ["06123","nick","#gmail","06454","abey","#gmail","06888","sisi"],
"media": ["nrc","nrc","nrc","nrc","nrc","nrc","nrc","nrc"]}
df = pd.DataFrame.from_dict(data, orient="index")
df = df.transpose()
>>> df
id val media
0 1 06123 nrc
1 1 nick nrc
2 1 #gmail nrc
3 2 06454 nrc
4 2 abey nrc
5 2 #gmail nrc
6 3 06888 nrc
7 3 sisi nrc
8 3 NaN NaN
Afterwards, you can replace with np.nan values with an empty string, so that you can groupby your id column and join the values in val separated by a ,.
df = df.replace(np.nan, "", regex=True)
df_new = df.groupby(["id"])["val"].apply(lambda x: ",".join(x)).reset_index()
>>> df_new
id val
0 1.0 06123,nick,#gmail
1 2.0 06454,abey,#gmail
2 3.0 06888,sisi,
Then, you only need to transform the new val column into 3 columns by splitting the string inside, with any method you want. For example,
new_cols = df_new["val"].str.split(",", expand=True) # Good ol' split
df_new["kolom 1"] = new_cols[0] # Assign to new columns
df_new["kolom 2"] = new_cols[1]
df_new["kolom 3"] = new_cols[2]
df_new.drop("val", 1, inplace=True) # Delete previous val
df_new["media"] = "nrc" # Add the media column again
df_new = df_new.replace("", np.nan, regex=True) # If necessary, replace empty string with np.nan
>>> df_new
id kolom 1 kolom 2 kolom 3 media
0 1.0 06123 nick #gmail nrc
1 2.0 06454 abey #gmail nrc
2 3.0 06888 sisi NaN nrc

Related

How to split dict in dataframe to many columns

I'm using dataframe. How to split dict list to many columns?
This is for a junior dataprocessor. In the past, I've tried on many ways.
import pandas as pd
l = [{'a':1,'b':2},{'a':3,'b':4}]
data = [{'key1':'x','key2':'y','value':l}]
df = pd.DataFrame(data)
data1 = {'key1':['x','x'],'key2':['y','y'],'a':[1,3],'b':[2,4]}
df1 = pd.DataFrame(data1)
df1 is what I need.
comprehension
d1 = df.drop('value', axis=1)
co = d1.columns
d2 = df.value
pd.DataFrame([
{**dict(zip(co, tup)), **d}
for tup, D in zip(zip(*map(d1.get, d1)), d2)
for d in D
])
a b key1 key2
0 1 2 x y
1 3 4 x y
Explode
See post on explode
This is a tad different but close
idx = df.index.repeat(df.value.str.len())
val = np.concatenate(df.value).tolist()
d0 = pd.DataFrame(val)
df.drop('value', axis=1).loc[idx].reset_index(drop=True).join(d0)
a b key1 key2
0 1 2 x y
1 3 4 x y

Pandas create row number - but not as an index

I want to create a row number series - but not override my date index.
I can do it with a loop but I think there must be an easier way?
_cnt = [ ]
for i in range ( len ( df ) ):
_cnt.append ( i )
df[ 'row' ] = _cnt
Thanks.
Probably the easiest way:
df['row'] = range(len(df))
>>> df
0 1
0 0.444965 0.993382
1 0.001578 0.174628
2 0.663239 0.072992
3 0.664612 0.291361
4 0.486449 0.528354
>>> df['row'] = range(len(df))
>>> df
0 1 row
0 0.444965 0.993382 0
1 0.001578 0.174628 1
2 0.663239 0.072992 2
3 0.664612 0.291361 3
4 0.486449 0.528354 4

Swap certain subset of column data

I'm trying to swap a subset of the data in two columns, but all the methods that I have found on SO give a full swap, or also swap the column names. This is what I would like:
df =
a b c
0 1 2 3
1 1 2 3
2 1 2 3
3 1 2 3
Then I create a random mask:
mask = np.random.choice([False, True], len(df), p=[0.5, 0.5])
Applying the mask and the swap, I want the result to look like this if I swap df[mask]['a'] and df[mask]['b']:
df =
a b c
0 1 2 3
1 2 1 3
2 1 2 3
3 2 1 3
What is the best way to achieve this result? I am using pandas 0.18.1
In one line:
mask = np.random.choice([False, True], len(df), p=[0.5, 0.5])
df.loc[mask, ['a', 'b']] = df.loc[mask, ['b', 'a']].values
Solution with numpy.where:
mask = np.random.choice([False, True], len(df), p=[0.5, 0.5])
df[['b', 'a']] = np.where(mask[:, None], df[['b', 'a']], df[['a', 'b']])
print (df)
a b c
0 1 2 3
1 2 1 3
2 2 1 3
3 2 1 3
You can try this
import pandas as pd
import numpy as np
df = pd.DataFrame({"a":[1]*4, "b":[2]*4})
mask = np.random.choice([False, True], len(df), p=[0.5, 0.5])
df["a_bk"] = df["a"].copy()
df["a"] = np.where(mask, df["b"], df["a"])
df["b"] = np.where(mask, df["a"], df["b"])
del df["a_bk"]

Python Pandas: merge, join, concat

I have a dataframe that has a non-unique GEO_ID, and an attribute (FTYPE) in a separate column (1 of 6 values) for each GEO_ID and an associated length for each FTYPE.
df
FID GEO_ID FTYPE Length_km
0 1400000US06001400100 428 3.291467766
1 1400000US06001400100 460 7.566487367
2 1400000US06001401700 460 0.262190266
3 1400000US06001401700 566 10.49899202
4 1400000US06001403300 428 0.138171389
5 1400000US06001403300 558 0.532913513
How do I make 6 new columns for FTYPE (with 1 and 0 to indicate if that row has the FTYPE) and 6 new columns for FTYPE_Length to make each row have a unique GEO_ID?
I want my new dataframe to have a structure like this (with 6 FTYPE-s):
FID GEO_ID FTYPE_428 FTYPE_428_length FTYPE_460 FTYPE_460_length
0 1400000US06001400100 1 3.291467766 1 7.566487367
So far, what I have tried is doing something like this:
import pandas as pd
fname = "filename.csv"
df = pd.read_csv(fname)
nhd = [334, 336, 420, 428, 460, 558, 556]
df1 = df.loc[df['FTYPE']==nhd[0]]
df2 = df.loc[df['FTYPE']==nhd[1]]
df3 = df.loc[df['FTYPE']==nhd[2]]
df4 = df.loc[df['FTYPE']==nhd[3]]
df5 = df.loc[df['FTYPE']==nhd[4]]
df6 = df.loc[df['FTYPE']==nhd[5]]
df7 = df.loc[df['FTYPE']==nhd[6]]
df12 = df1.merge(df2, how='left', left_on='GEO_ID', right_on='GEO_ID')
df23 = df12.merge(df3,how='left', left_on='GEO_ID', right_on='GEO_ID')
df34 = df23.merge(df4,how='left', left_on='GEO_ID', right_on='GEO_ID')
df45 = df34.merge(df5,how='left', left_on='GEO_ID', right_on='GEO_ID')
df56 = df45.merge(df6,how='left', left_on='GEO_ID', right_on='GEO_ID')
df67 = df56.merge(df7,how='left', left_on='GEO_ID', right_on='GEO_ID')
cols = [0,4,7,10,13,16,19]
df67.drop(df67.columns[cols],axis=1,inplace=True)
df67.columns =['GEO_ID','334','len_334','336','len_336','420','len_420','428','len_428','460','len_460','558','len_558','566','len_566']
But this approach is problematic because it reduces the rows to the ones that have the first two FTYPE-s. Is there a way to merge with multiple columns at once?
Its probably easier to write a for loop and go over each row and use a condition to fill in the values like this:
nhd = [334, 336, 420, 428, 460, 558, 556]
for x in nhd:
df[str(x)] = None
df["length_"+str(x)] = None
df.head()
for geoid in df["GEO_ID"]:
#print geoid
for x in nhd:
df.ix[(df['FTYPE']==x) & (df['GEO_ID'] == geoid)][str(nhd)] = 1
But this takes too much time and there is probably a one liner in Pandas to do the same thing.
Any help on this is appreciated!
Thanks,
Solomon
I don't quite see the point of your _length columns: they seem to have the same information as just whether or not the matching value is null or not, which makes them redundant. They're easy enough to create, though.
While we could cram this into one line if we insisted, what's the point? This is SO, not codegolf. So I might do something like:
df = df.pivot(index="GEO_ID", columns="FTYPE", values="Length_km")
df.columns = "FTYPE_" + df.columns.astype(str)
has_value = df.notnull().astype(int)
has_value.columns += '_length'
final = pd.concat([df, has_value], axis=1).sort_index(axis='columns')
which gives me (using your input data, which only has 5 distinct FTYPEs):
In [49]: final
Out[49]:
FTYPE_334 FTYPE_334_length FTYPE_428 \
GEO_ID
1400000US06001400100 NaN 0 3.291468
1400000US06001401700 NaN 0 NaN
1400000US06001403300 NaN 0 0.138171
1400000US06001403400 0.04308 1 NaN
FTYPE_428_length FTYPE_460 FTYPE_460_length \
GEO_ID
1400000US06001400100 1 7.566487 1
1400000US06001401700 0 0.262190 1
1400000US06001403300 1 NaN 0
1400000US06001403400 0 NaN 0
FTYPE_558 FTYPE_558_length FTYPE_566 FTYPE_566_length
GEO_ID
1400000US06001400100 NaN 0 NaN 0
1400000US06001401700 NaN 0 10.498992 1
1400000US06001403300 0.532914 1 1.518864 1
1400000US06001403400 NaN 0 NaN 0

Python 3.4 Pandas DataFrame Structuring

QUESTION
How can I get rid of the repeated column labels for each line of data?
CODE
req = urllib.request.Request(newIsUrl)
resp = urllib.request.urlopen(req)
respData = resp.read()
dRespData = respData.decode('utf-8')
df = pd.DataFrame(columns= ['Ticker', 'GW', 'RE', 'OE', 'NI', 'CE'])
df = df.append({'Ticker':ticker,
'GW':gw,
'RE':rt,
'OE':oe,
'NI':netInc,
'CE':capExp}, ignore_index= True)
print(df)
yhooKeyStats()
acquireData()
OUTCOME
Ticker GW RE OE NI CE
0 MMM [7,050,000] [34,317,000] [13,109,000] [4,956,000] [(1,493,000)]
Ticker GW RE OE NI CE
0 ABT [17,501,000] [7,412,000] [12,156,000] [2,437,000]
NOTES
all of the headers and data line up respectively
headers are repeated in the dataframe for each line of data
You can skip every other line with a slice and iloc:
In [11]: df = pd.DataFrame({0: ['A', 1, 'A', 3], 1: ['B', 2, 'B', 4]})
In [12]: df
Out[12]:
0 1
0 A B
1 1 2
2 A B
3 3 4
In [13]: df.iloc[1::2]
Out[13]:
0 1
1 1 2
3 3 4