create a dataframe from a list of length-unequal lists - pandas

I try to convert such a list:
l = [[1, 2, 3, 17], [4, 19], [5]]
to a dataframe having each of the number as indice, and position of list as value.
For example, 19 is in the second list, I thus expect to get somwhere one row with "19" as index and "1" as value, and so on.
I managed to get it (cf.boiler plate below), but I guess there is something more simple
>>> df=pd.DataFrame(l)
>>> df=df.unstack().reset_index(level=0,drop=True)
>>> df=df[df.notnull()==True] # remove NaN rows
>>> df=pd.DataFrame(df)
>>> df = df.reset_index().set_index(0)
>>> print df
index
0
1 0
4 1
5 2
2 0
19 1
3 0
17 0
Thanks in advance.

In [52]: pd.DataFrame([(item, i) for i, seq in enumerate(l)
for item in seq]).set_index(0)
Out[52]:
1
0
1 0
2 0
3 0
17 0
4 1
19 1
5 2

Related

How to convert the pandas series to a dataframe that is shown in the attached figure?

Here I want to convert [] square bracket values to Dataframe...but I can not convert it. Please give a solution of this problem
Assuming this input Series:
series = pd.Series([[1,2,3],[4,5,6]])
0 [1, 2, 3]
1 [4, 5, 6]
dtype: object
To convert the list to rows, use pandas.Series.explode:
series.explode()
0 1
0 2
0 3
1 4
1 5
1 6
dtype: object
To convert the list to columns, use apply + pandas.Series:
series.apply(pd.Series)
0 1 2
0 1 2 3
1 4 5 6

adding multiple lists into one column DataFrame pandas

l = {'col1': [[1,2,3], [4,5,6]]}
df = pd.DataFrame(data = l)
col1
0 [1, 2, 3]
1 [4, 5, 6]
Desired output:
col1
0 1
1 2
2 3
3 4
4 5
5 6
Here is explode
df.explode('col1')
col1
0 1
0 2
0 3
1 4
1 5
1 6
You can use np.ravel to flatten the list of lists:
import numpy as np, pandas as pd
l = {'col1': [[1,2,3], [4,5,6]]}
df = pd.DataFrame(np.ravel(*l.values()),columns=l.keys())
>>> df
col1
0 1
1 2
2 3
3 4
4 5
5 6

Replace pandas values as index to another array

Consider an array
a = np.array([5, 12, 56, 36])
and a pandas dataframe
b = pandas.DataFrame(np.array([1, 3, 0, 3, 1, 0, 2])
how does one replace the values on b by using its values as indexes for a, i.e., the intended value is:
c = pandas.DataFrame([12, 36, 5, 36, 12, 5, 56])
Can't quite figure this out.
One way is using apply,
c = b.apply(lambda x: a[x])
Or by indexing the numpy array and passing the values to DataFrame,
c = pd.DataFrame(a[b[0].values])
0
0 12
1 36
2 5
3 36
4 12
5 5
6 56
Let us try something different Series.get
pd.Series(a).get(b[0])
Out[57]:
1 12
3 36
0 5
3 36
1 12
0 5
2 56
dtype: int32
map can be used.
b.a.map({i:j for i,j in enumerate(a)})
0 12
1 36
2 5
3 36
4 12
5 5
6 56
Name: a, dtype: int64

Why does groupby in Pandas place counts under existing column names?

I'm coming from R and do not understand the default groupby behavior in pandas. I create a dataframe and groupby the column 'id' like so:
d = {'id': [1, 2, 3, 4, 2, 2, 4], 'color': ["r","r","b","b","g","g","r"], 'size': [1,2,1,2,1,3,4]}
df = DataFrame(data=d)
freq = df.groupby('id').count()
When I check the header of the resulting dataframe, all the original columns are there instead of just 'id' and 'freq' (or 'id' and 'count').
list(freq)
Out[117]: ['color', 'size']
When I display the resulting dataframe, the counts have replaced the values for the columns not employed in the count:
freq
Out[114]:
color size
id
1 1 1
2 3 3
3 1 1
4 2 2
I was planning to use groupby and then to filter by the frequency column. Do I need to delete the unused columns and add the frequency column manually? What is the usual approach?
count aggregate all columns of DataFrame with excluding NaNs values, if need id as column use as_index=False parameter or reset_index():
freq = df.groupby('id', as_index=False).count()
print (freq)
id color size
0 1 1 1
1 2 3 3
2 3 1 1
3 4 2 2
So if add NaNs in each column should be differences:
d = {'id': [1, 2, 3, 4, 2, 2, 4],
'color': ["r","r","b","b","g","g","r"],
'size': [np.nan,2,1,2,1,3,4]}
df = pd.DataFrame(data=d)
freq = df.groupby('id', as_index=False).count()
print (freq)
id color size
0 1 1 0
1 2 3 3
2 3 1 1
3 4 2 2
You can specify columns for count:
freq = df.groupby('id', as_index=False)['color'].count()
print (freq)
id color
0 1 1
1 2 3
2 3 1
3 4 2
If need count with NaNs:
freq = df.groupby('id').size().reset_index(name='count')
print (freq)
id count
0 1 1
1 2 3
2 3 1
3 4 2
d = {'id': [1, 2, 3, 4, 2, 2, 4],
'color': ["r","r","b","b","g","g","r"],
'size': [np.nan,2,1,2,1,3,4]}
df = pd.DataFrame(data=d)
freq = df.groupby('id').size().reset_index(name='count')
print (freq)
id count
0 1 1
1 2 3
2 3 1
3 4 2
Thanks Bharath for pointed for another solution with value_counts, differences are explained here:
freq = df['id'].value_counts().rename_axis('id').to_frame('freq').reset_index()
print (freq)
id freq
0 2 3
1 4 2
2 3 1
3 1 1

Replacing values in column not working pandas

my sample data set:
import pandas as pd
import numpy as np
df = {'ID': ['A',0,0,1,'A',1],
'ID1':['Yes','Yes','No','No','Yes','Yes']}
df = pd.DataFrame(df)
my real data set is read in from an excel file, the column 'ID1' contains 'Yes' or 'No'. the column 'ID' contains 1, 0 and 'A'.
I want to:
For column 'ID1' I want to replace 'Yes' as 1 and 'No' as 0.
for column 'ID' I want to replace 'A' as 0
I tried following ways
# The values didn't change
df['ID1']=df['ID1'].replace(['Yes', 'No'], [1, 0])
# Or, The values didn't change
df['ID1']=df['ID1'].replace(['Yes', 'No'], [1, 0],inplace='ignore')
# Or, it turns 'A' to 'nan'
df['ID'] = df['ID'].map({1: 1, 0: 0, 'A':0})
# OR, it turns 'A' to 'nan'
df['ID'] = df['ID'].map({1: 1, 0: 0, 'A':0}, na_action=None)
My code works perfectly if you run my sample data set code to get the sample data set, which converts the Series into DF, but it doesn't work with my real data set which I read in from an excel file. I searched online but couldn't figure out why. these columns from my real data set are object type, i tried converting to string but still doesn't work.
edit:
my code for reading my real data set:
path =os.chdir(r"S:\path")
df1 = pd.read_excel('data.xlsx',skiprows=[0])
df1['ID']=df1['ID'].str.strip()
df1['ID'] = df1['ID'].map({'1': 1, '0': 0, 'A':0}, na_action=None)
df1['ID1']=df1['ID1'].str.strip()
df1['ID1']=df1['ID1'].replace(['Yes', 'No'], [1, 0])
df1.head()
Out[55]:
ID1 ID
0 1 NaN
1 1 NaN
2 1 NaN
3 1 0.0
4 1 NaN
I have uploaded my file online, please check this link : https://filebin.ca/3UAh5051Psnv/test.xlsx
Try to clean up ID1 and ID columns:
df['ID'] = df['ID'].astype(str).str.strip().map({'1': 1, '0': 0, 'A':0}, na_action=None)
df['ID1'] = df['ID1'].str.strip().replace(['Yes', 'No'], [1, 0])
Result:
In [234]: df
Out[234]:
ID1 ID
0 1 1
1 1 1
2 1 1
3 1 0
4 1 1
5 1 1
6 1 0
7 1 1
8 1 1
9 1 1
10 1 1
11 1 1
12 1 1
13 1 0
14 1 1
15 1 1
16 1 0
17 1 1
18 1 1
19 1 1
20 1 1
21 1 1
22 1 1
23 1 1
24 1 1
25 1 1
26 1 1
27 1 1
28 1 1
29 1 1
30 1 1