Single cell string to list to multiple rows - pandas

I have a pandas data frame,
Currently the list column is a string, I want to delimit this by spaces and replicate rows for each primary key would be associated with each item in the list. Can you please advise me on how I can achieve this?
Edit:
I need to copy down the value column after splitting and stacking the list column

If your data frame is df you can do:
df.List.str.split(' ').apply(pd.Series).stack()
and you will get
Primary Key
0 0 a
1 b
2 c
1 0 d
1 e
2 f
dtype: object
You are splitting the variable List on spaces, turning the resulting list into a series, and then stacking it to turn it into long format, indexed on the primary key, along with a sequence for each item obtained from the split.

My version:
df['List'].str.split().explode()
produces
0 a
0 b
0 c
1 d
1 e
1 f
With regards to the Edit of the question, the following tweak will give you want you need I think:
df['List'] = df['List'].str.split()
df.explode('List')

Here is a solution.
df = df.assign(**{'list':df['list'].str.split()}).explode('list')
df['cc'] = df.groupby(level=0)['list'].cumcount()
df.set_index(['cc'],append=True)

Related

Select rows where column value is a combination of numbers and letters

Having a dataset like this:
word
0 TBH46T
1 BBBB
2 5AAH
3 CAAH
4 AAB1
5 5556
Which would be the most efficient way to select the rows where column word is a combination of numbers and letters?
The output would be like this:
word
0 TBH46T
2 5AAH
4 AAB1
A possible solution would be to create a new column using apply and regex in which store if column word has the desired structure. But I'm curious about if this could be achieved in a more straightforward way.
Use Series.str.contains for chain mask for match numeric and for match non numeric with & for bitwise AND:
df = df[df['word'].str.contains('\d') & df['word'].str.contains('\D')]
print (df)
word
0 TBH46T
2 5AAH
4 AAB1

using list as an argument in groupby() in pandas and none of the key elements match column or index names

So I have a random values of dataframe as below and a book I am studying uses a list was groupby key (key_list). How is the dataframe grouped in this case since none of list values match column or index names? So, the last two lines are confusing to me.
people = pd.DataFrame(np.random.randn(5,5), columns = ['a','b','c','d','e'], index=['Joe','Steve','Wes','Jim','Travis'])
key_list = ['one','one','one','two','two']
people.groupby(key_list).min()
people.groupby([len, key_list]).min()
Thank you in advance!
The user guide on groupby explains a lot and I suggest you have a look at it. I'll explain as much as I understand for your use case.
You can verify the groups created using the group method:
people.groupby(key_list).groups
{'one': Index(['Joe', 'Steve', 'Wes'], dtype='object'),
'two': Index(['Jim', 'Travis'], dtype='object')}
You have your dictionary with the keys 'one' and two' being the groups from the key_list list. As such when you ask for the 'min', it looks at each group and picks out the minimum, indexed from the first column. Let's inspect group 'one' using the get_group method:
people.groupby(key_list).get_group('one')
a b c d e
Joe -0.702122 0.277164 1.017261 -1.664974 -1.852730
Steve -0.866450 -0.373737 1.964857 -1.123291 1.251595
Wes -0.043835 -0.011108 0.214802 0.065022 -1.335713
You can see that Steve has the lowest value from column 'a'. when you run the next line it should give you that:
people.groupby(key_list).get_group('one').min()
a -0.866450
b -0.373737
c 0.214802
d -1.664974
e -1.852730
dtype: float64
The same concept applies when you run it on the second group 'two'. As such, when you run the first part of your groupby code:
people.groupby(key_list).min()
You get the minimum row indexed at 'a' for each group:
a b c d e
one -0.866450 -0.373737 0.214802 -1.664974 -1.852730
two -1.074355 -0.098190 -0.595726 -2.194481 0.232505
The second part of your code, which involves the len applies the same grouping concept. In this case, it groups the dataframe according to the length of the strings in its index: (Jim, Joe, Wes) - 3 letters, (Steve) - 5 letters, (Travis) - 6 letters, and then groups with the key_list to give the final output:
a b c d e
3 one -0.702122 -0.011108 0.214802 -1.664974 -1.852730
two -0.928987 -0.098190 3.025985 0.702471 0.232505
5 one -0.866450 -0.373737 1.964857 -1.123291 1.251595
6 two -1.074355 1.110879 -0.595726 -2.194481 0.394216
Note that for 3 it spills out 'one' and 'two' because 'Joe' and 'Wes' are in group 'one' but the lowest is 'Joe', while 'Jim' is the only three letter word in group 'two'. The same concept goes for 5 letter and 6 letter words.

Get column index label based on values

I have the following:
C1 C2 C3
0 0 0 1
1 0 0 1
2 0 0 1
And i would like to get the corresponding column index value that has 1's, so the result
should be "C3".
I know how to do this by transposing the dataframe and then getting the index values, but this is not ideal for data in the dataframes i have, and i wonder there might be a more efficient solution?
I will save the result in a list because otherwise there could be more than one column with values ​​equal to 1. You can use DataFrame.loc
if all column values ​​must be 1 then you can use:
df.loc[:,df.eq(1).all()].columns.tolist()
Output:
['C3']
if this isn't necessary then use:
df.loc[:,df.eq(1).any()].columns.tolist()
or as suggested #piRSquared, you can select directly from df.columns:
[*df.columns[df.eq(1).all()]]

change value (string manipulation) in Pandas DataFrame

I am reading a CSV file to Pandas DataFrame but need to be cleaned up before can be used. I need to do two things:
use regex to filter values
apply string functions such as trim, left, right, ...
For instance, DataFrame may looks like:
0 city_some_string_45
1 city_Other_string_56
2 city_another_string_77
so I need to filter (using regex) for all rows that its value start with "city" and get last two character.
the end result should looks like:
0 45
1 56
2 77
In another word, logic I want to apply is: read value of cell and if starts with city (filtering with regex ie: ^city) and replace the value of cell with its two last character of the cell (eg using right string function)
For a dataframe like this:
No city
0 0 city_some_string_45
1 1 city_Other_string_56
2 2 city_another_string_77
Filter the dataframe to keep the rows with city column starting with city
df = df[df.city.str.startswith('city')]
You can use str.extract to extract only the number
df['city'] = df.city.str.extract('(\d+)').astype(int)
The resulting df
No city
0 0 45
1 1 56
2 2 77

Excel: one column has duplicates of each value, I need to take averages of the corresponding two values from the other columns

Example:
column A column B
A 1
A 2
B 2
B 2
C 1
C 1
I would somehow like to get the following result:
column A column B
A 1.5
B 2
C 1
(which are averages of 1 and 2, 2 and 2 and 1 and 1)
How do I achieve that?
Thanks
If you're using Excel 2007 or above, you can also use the shorter AVERAGEIF function:
=AVERAGEIF($A$1:$A:$6,D1,$B$1:$B$6)
Less typing, easier to read..
In D1:D3, type A, B, C. Then in E1, put this formula
=SUMIF($A$1:$A$6,D1,$B$1:$B$6)/COUNTIF($A$1:$A$6,D1)
and fill down to E3. If you want to replace the existing data, copy E1:E3 and paste-special-values over itself. Then delete A:C.
Alternatively, you can add headers to your data, say "Letter" and "Number". Then create a Pivot Table from your data. Put Letter in the rows section and Number in the Data section. Change your Data section from SUM to AVERAGE and you'll get the same result.