I have a pandas dataframe.
One of its columns contains a list of 60 elements, constant across its rows.
How do I convert each of these lists into a row of a new dataframe?
Just to be clearer: say A is the original dataframe with n rows. One of its columns contains a list of 60 elements.
I need to create a new dataframe nx60.
My tentative:
def expand(x):
return(pd.DataFrame(np.array(x)).reshape(-1,len(x)))
df["col"].apply(lambda x: expand(x))]
it gives funny results....
The weird thing is that if i call the function "expand" on a single raw, it does exactly what I expect from it
expand(df["col"][0])
To ChootsMagoots: Thjis is the result when i try to apply your suggestion. It does not work.
Sample data
df = pd.DataFrame()
df['col'] = np.arange(4*5).reshape(4,5).tolist()
df
Output:
col
0 [0, 1, 2, 3, 4]
1 [5, 6, 7, 8, 9]
2 [10, 11, 12, 13, 14]
3 [15, 16, 17, 18, 19]
now exctract DataFrame from col
df.col.apply(pd.Series)
Output:
0 1 2 3 4
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
Try this:
new_df = pd.DataFrame(df["col"].tolist())
This is a little frankensteinish, but you could also try:
import numpy as np
np.savetxt('outfile.csv', np.array(df['col'].tolist()), delimiter=',')
new_df = pd.read_csv('outfile.csv')
You can try this as well:
newCol = pd.Series(yourList)
df['colD'] = newCol.values
The above code:
1. Creates a pandas series.
2. Maps the series value to columns in original dataframe.
Related
I wish to extract the list of values from one column in pandas how to extract the list of values from one column and then use those values to create additional columns based on number of values within the list.
My dataframe:
a = pd.DataFrame({"test":["","","","",[1,2,3,4,5,6,6],"","",[11,12,13,14,15,16,17]]})
Current output:
test
0
1
2
3
4 [1, 2, 3, 4, 5, 6, 6]
5
6
7 [11, 12, 13, 14, 15, 16, 17]
expected output:
example_1 example_2 example_3 example_4 example_5 example_6 example_7
0
1
2
3
4 1 2 3 4 5 6 6
5
6 11 12 13 14 15 16 17
lets say we expect it to have 7 values for each list. <- this is most of my current case so if I can set the limit then it will be a good one. Thank you.
This should be what you're looking for. I replaced the nan values with blank cells, but you can change that of course.
a = pd.DataFrame({"test":["","","","",[1,2,3,4,5,6,6],"","",[11,12,13,14,15,16,17]]})
ab = a.test.apply(pd.Series).fillna("")
ab.columns = ['example_' + str(i) for i in range(1, 8)]
Output:
Edit: using .add_prefix() as the other answer uses is prettier than setting the column names manually with a list comprehension.
Here's a one-liner:
(pd.DataFrame
(a.test.apply(pd.Series)
.fillna("")
.set_axis(range(1, a.test.str.len().max() + 1), axis=1)
.add_prefix("example_")
)
The set_axis is just to make the columns 1-indexed, if you don't mind 0-indexing you can leave it out.
How to add a list to a dataframe column such that the values repeat for every row of the dataframe?
mylist = ['one error','delay error']
df['error'] = mylist
This gives error of unequal length as df has 2000 rows. I can still add it if I make mylist into a series, however that only appends to the first row and the output looks like this:
d = {'col1': [1, 2, 3, 4, 5],
'col2': [3, 4, 9, 11, 17],
'error':['one error',np.NaN,np.NaN,np.NaN,np.NaN]}
df = pd.DataFrame(data=d)
However I would want the solution to look like this:
d = {'col1': [1, 2, 3, 4, 5],
'col2': [3, 4, 9, 11, 17],
'error':[''one error','delay error'',''one error','delay error'',''one error','delay error'',''one error','delay error'',''one error','delay error'']}
df = pd.DataFrame(data=d)
I have tried ffill() but it didn't work.
You can assign to the result of df.to_numpy(). Note that you'll have to use [mylist] instead of mylist, even though it's already a list ;)
>>> mylist = ['one error']
>>> df['error'].to_numpy()[:] = [mylist]
>>> df
col1 col2 error
0 1 3 [one error]
1 2 4 [one error]
2 3 9 [one error]
3 4 11 [one error]
4 5 17 [one error]
>>> mylist = ['abc', 'def', 'ghi']
>>> df['error'].to_numpy()[:] = [mylist]
>>> df
col1 col2 error
0 1 3 [abc, def, ghi]
1 2 4 [abc, def, ghi]
2 3 9 [abc, def, ghi]
3 4 11 [abc, def, ghi]
4 5 17 [abc, def, ghi]
It's not a very clean way to do it, but you can first update your mylist to become the same length as the rows in dataframe, and only then you put it into your dataframe.
mylist = ['one error','delay error']
new_mylist = [mylist for i in range(len(df['col1']))]
df['error'] = new_mylist
Repeat the elements in mylist exactly N times where N is the ceil of quotient obtained after dividing length of dataframe with length of list, now assign this to new column but while assigning make sure that the length of repeated list don't exceed the length of column
df['error'] = (mylist * (len(df) // len(mylist) + 1))[:len(df)]
col1 col2 error
0 1 3 one error
1 2 4 delay error
2 3 9 one error
3 4 11 delay error
4 5 17 one error
df.assign(error=mylist.__str__())
This is a multipart problem. I have found solutions for each separate part, but when I try to combine these solutions, I don't get the outcome I want.
Let's say this is my dataframe:
df = pd.DataFrame(list(zip([1, 3, 6, 7, 7, 8, 4], [6, 7, 7, 9, 5, 3, 1])), columns = ['Values', 'Vals'])
df
Values Vals
0 1 6
1 3 7
2 6 7
3 7 9
4 7 5
5 8 3
6 4 1
Let's say I want to find the pattern [6, 7, 7] in the 'Values' column.
I can use a modified version of the second solution given here:
Pandas: How to find a particular pattern in a dataframe column?
pattern = [6, 7, 7]
pat_i = [df[i-len(pattern):i] # Get the index
for i in range(len(pattern), len(df)) # for each 3 consequent elements
if all(df['Values'][i-len(pattern):i] == pattern)] # if the pattern matched
pat_i
[ Values Vals
2 6 7
3 7 9
4 7 5]
The only way I've found to narrow this down to just index values is the following:
pat_i = [df.index[i-len(pattern):i] # Get the index
for i in range(len(pattern), len(df)) # for each 3 consequent elements
if all(df['Values'][i-len(pattern):i] == pattern)] # if the pattern matched
pat_i
[RangeIndex(start=2, stop=5, step=1)]
Once I've found the pattern, what I want to do, within the original dataframe, is reorder the pattern to [7, 7, 6], moving the entire associated rows as I do this. In other words, going by the index, I want to get output that looks like this:
df.reindex([0, 1, 3, 4, 2, 5, 6])
Values Vals
0 1 6
1 3 7
3 7 9
4 7 5
2 6 7
5 8 3
6 4 1
Then, finally, I want to reset the index so that the values in all the columns stay in the new re-ordered place;
Values Vals
0 1 6
1 3 7
2 7 9
3 7 5
4 6 7
5 8 3
6 4 1
In order to use pat_i as a basis for re-ordering, I've tried to modify the second solution given here:
Python Pandas: How to move one row to the first row of a Dataframe?
target_row = 2
# Move target row to first element of list.
idx = [target_row] + [i for i in range(len(df)) if i != target_row]
However, I can't figure out how to exploit the pat_i RangeIndex object to use it with this code. The solution, when I find it, will be applied to hundreds of dataframes, each one of which will contain the [6, 7, 7] pattern that needs to be re-ordered in one place, but not the same place in each dataframe.
Any help appreciated...and I'm sure there must be an elegant, pythonic way of doing this, as it seems like it should be a common enough challenge. Thank you.
I just sort of rewrote your code. I held the first and last indexes to the side, reordered the indexes of interest, and put everything together in a new index. Then I just use the new index to reorder the data.
import pandas as pd
from pandas import RangeIndex
df = pd.DataFrame(list(zip([1, 3, 6, 7, 7, 8, 4], [6, 7, 7, 9, 5, 3, 1])), columns = ['Values', 'Vals'])
pattern = [6, 7, 7]
new_order = [1, 2, 0] # new order of pattern
for i in list(df[df['Values'] == pattern[0]].index):
if all(df['Values'][i:i+len(pattern)] == pattern):
pat_i = df[i:i+len(pattern)]
front_ind = list(range(0, pat_i.index[0]))
back_ind = list(range(pat_i.index[-1]+1, len(df)))
pat_ind = [pat_i.index[i] for i in new_order]
new_ind = front_ind + pat_ind + back_ind
df = df.loc[new_ind].reset_index(drop=True)
df
Out[82]:
Values Vals
0 1 6
1 3 7
2 7 9
3 7 5
4 6 7
5 8 3
6 4 1
I want to get the values that have a specific result on .value_counts(), eg:
import numpy as np
import pandas as pd
data = np.array([3, 3, 5, 6, 7, 7, 7, 9, 9])
ser = pd.Series(data)
counts_nums = ser.value_counts()
print(counts_nums)
Here are the results:
7 3
9 2
3 2
6 1
5 1
dtype: int64
Now, I want to find a way to get the values have a count number equal to 2, which are 9 and 3. In other words, I want to index .value_counts()
What are the different ways of doing this?
Try this to get a list.
count_values[count_values == 2].index.tolist()
Output:
[9, 3]
counts_nums.loc[counts_nums==2]
It was working great until it wasn't, and no idea what I'm doing wrong. I've reduced it to a very simple datsaset t:
1 2 3 4 5 6 7 8
0 3 16 3 2 17 2 3 2
1 3 16 3 2 19 4 3 2
2 3 16 3 2 9 2 3 2
3 3 16 3 2 19 1 3 2
4 3 16 3 2 17 2 3 1
5 3 16 3 2 17 1 17 1
6 3 16 3 2 19 1 17 2
7 3 16 3 2 19 4 3 1
8 3 16 3 2 19 1 3 2
9 3 16 3 2 7 2 17 1
corr = t.corr()
corr
returns "__"
and
sns.heatmap(corr)
throws the following error "zero-size array to reduction operation minimum which has no identity"
I have no idea what's wrong? I've tried it with more rows etc, and double checked that I don't have nay missing values...what's going on? I had such a pretty heatmap earlier, I've been trying to
As mentioned above, change type to float. Simply,
corr = t.astype('float64').corr()
The problem here is not the dataframe itself but the origin of it. I found same problem by using drop or iloc in a dataframe. The key is the global type the dataframe has.
Let's say we have the following dataframe:
list_ex = [[1.1,2.1,3.1,4,5,6,7,8],[1.2,2.2,3.3,4.1,5.5,6,7,8],
[1.3,2.3,3,4,5,6.2,7,8],[1.4,2.4,3,4,5,6.2,7.3,8.1]]
list_ex_new=pd.DataFrame(list_ex)
you can calculate the list_ex_new.corr() with no problem. If you check the arguments of the dataframe by vars(list_ex_new), you'll obtain:
{'_is_copy': None, '_data': BlockManager
Items: RangeIndex(start=0, stop=8, step=1)
Axis 1: RangeIndex(start=0, stop=4, step=1)
FloatBlock: slice(0, 8, 1), 8 x 4, dtype: float64, '_item_cache': {}}
where dtype is float64.
A new dataframe can be defined by list_new_new = list_ex_new.iloc[1:,:] and the correlations can be evaluated, successfully. A check of the dataframe's attributes shows:
{'_is_copy': ,
'_data': BlockManager
Items: RangeIndex(start=0, stop=8, step=1)
Axis 1: RangeIndex(start=1, stop=4, step=1)
FloatBlock: slice(0, 8, 1), 8 x 3, dtype: float64,
'_item_cache': {}}
where dtype is still float64.
A third dataframe can be defined:
list_ex_w = [['a','a','a','a','a','a','a','a'],[1.1,2.1,3.1,4,5,6,7,8],
[1.2,2.2,3.3,4.1,5.5,6,7,8],[1.3,2.3,3,4,5,6.2,7,8],
[1.4,2.4,3,4,5,6.2,7.3,8.1]]
list_ex_new_w=pd.DataFrame(list_ex_w)
A evaluation of dataframe's correlation will result in a empty dataframe, since list_ex_w attributes look like:
{'_is_copy': None, '_data': BlockManager
Items: RangeIndex(start=0, stop=8, step=1)
Axis 1: Index(['a', 1, 2, 3, 4], dtype='object')
ObjectBlock: slice(0, 8, 1), 8 x 5, dtype: object, '_item_cache': {}}
Where now dtype is 'object', since the dataframe is not consistent in its types. there are strings and floats together. Finally, a fourth dataframe can be generated:
list_new_new_w = list_ex_new_w.iloc[1:,:]
this will generate as a result same notebook but with no 'a's, apparently a perfectly correct dataframe to calculate the correlations. However this will return again an empty dataframe. A final check of the dataframe attributes shows:
vars(list_new_new_w)
{'_is_copy': None, '_data': BlockManager
Items: Index([1, 2, 3, 4], dtype='object')
Axis 1: RangeIndex(start=0, stop=8, step=1)
ObjectBlock: slice(0, 4, 1), 4 x 8, dtype: object, '_item_cache': {}}
where dtype is still object, thus the method corr returns an empty dataframe.
This problem can be solved by using astype(float)
list_new_new_w.astype(float).corr()
In summary, it seems pandas at the time corr or cov among others methods are called generate a new dataframe with same attibutes ignoring the case the new dataframe has a consistent global type. I've been checking out the pandas source code and I understand this is the correct interpretation of pandas' implementation.