How can I select Pandas.DataFrame by elements' length.
import pandas as pd;
import numpy as np;
df=pd.DataFrame(np.random.randn(4,4).astype(str))
df.apply(lambda x: len(x[1]))
0 19
1 19
2 18
3 20
dtype: int64
here, we see there are three kinds of lengths.
I've searching for something like this kind operation df[len(df)==19], is it possible?
You could take advantage of the vectorized string operations available under .str, instead of using apply:
>>> df.applymap(len)
0 1 2 3
0 19 18 18 21
1 20 18 19 18
2 18 19 20 18
3 19 19 18 18
>>> df[1].str.len()
0 18
1 18
2 19
3 19
Name: 1, dtype: int64
>>> df.loc[df[1].str.len() == 19]
0 1 2 3
2 0.2630843312551179 -0.4811731811687397 -0.04493981407412525 -0.378866831599991
3 -0.5116348949042413 0.07649572869385729 0.8899251802216082 0.5802762385702874
Here is a simple example to show you what is going on:
import pandas as pd
import numpy as np
df=pd.DataFrame(np.random.randn(4,4).astype(str))
lengths = df.apply(lambda x: len(x[0]))
mask = lengths < 15
print df
print lengths
print mask
print df[mask]
Results in:
0 1 2 3
0 0.315649003654 -1.20005871043 -0.0973557747322 -0.0727740019505
1 -0.270800223158 -2.96509489589 0.822922470677 1.56021584947
2 -2.36245475786 0.0763821870378 1.0540009757 -0.452842084388
3 -1.03486927366 -0.269946751202 0.0611709385483 0.0114964425747
0 14
1 14
2 16
3 16
dtype: int64
0 True
1 True
2 False
3 False
dtype: bool
0 1 2 3
0 0.315649003654 -1.20005871043 -0.0973557747322 -0.0727740019505
1 -0.270800223158 -2.96509489589 0.822922470677 1.56021584947
Related
is there a way to find the element in a pandas data frame by using the row and column values.For example, if we have a list, L = [0,3,2,3,2,4,30,7], we can use L[2] and get the value 2 in return.
Use .iloc
df = pd.DataFrame({'L':[0,3,2,3,2,4,30,7], 'M':[10,23,22,73,72,14,130,17]})
L M
0 0 10
1 3 23
2 2 22
3 3 73
4 2 72
5 4 14
6 30 130
7 7 17
df.iloc[2]['L']
df.iloc[2:3, 0:1]
df.iat[2, 0]
2
df.iloc[6]['M']
df.iloc[6:7, 1:2]
df.iat[6, 1]
130
I have a julia dataframe that looks like this:
time data
0 34
1 34
2 30
3 37
4 32
5 35
How do I create a new binary column that is 0 if time is less than 2 and greater than 4, and 1 if not either condition?
Like this:
time data x
0 34 0
1 34 0
2 30 1
3 37 1
4 32 1
5 35 0
In python, I would do something like:
def func(df):
if df.time < 2 or df.time > 4:
return 0
else:
return 1
df['x'] = df.apply(func, axis=1)
In Julia we have the beautiful Dot Syntax which can be gracefully applied here:
julia> df[!, :x] = 2 .<= df[!, :time] .<= 4
6-element BitVector:
0
0
1
1
1
0
or alternatively
df.x = 2 .<= df.time .<= 4
Why and what is this error about??? It shows Length of passed
values is 1, index implies 10. I tried many times to run the
code and I come across the same
ser = pd.Series(np.random.randint(1, 50, 10))
result = np.argwhere(ser % 3==0)
print(result)
argwhere() operates on a numpy array not a panda series. See below
a = np.random.randint(1, 50, 12)
a = pd.Series(a)
print(a)
np.argwhere(a.values%3==0)
output
0 28
1 46
2 4
3 40
4 19
5 26
6 6
7 24
8 26
9 30
10 33
11 27
dtype: int64
[250]:
array([[ 6],
[ 7],
[ 9],
[10],
[11]])
Please read documentation for numpy.random.randint You will see that the parameters are (low, high, size).
In your case, you are sending (1, 50, 10). So 10 random numbers will be generated between 1 and 50.
If you want multiples of 3, then you need to do this ser[ser % 3==0] not use np.anywhere.
See similar issue raised earlier and answered on Stack Overflow
import pandas as pd
import numpy as np
ser = pd.Series(np.random.randint(1, 50, 10))
print (ser)
result = ser[ser % 3==0]
print(result)
Output of this will be:
Original Series.
0 17
1 34
2 29
3 15
4 24
5 20
6 21
7 48
8 6
9 42
dtype: int64
Multiples of 3 will be:
3 15
4 24
6 21
7 48
8 6
9 42
dtype: int64
Use Index.tolist:
In [1374]: ser
Out[1374]:
0 44
1 5
2 35
3 10
4 16
5 20
6 25
7 9
8 44
9 16
dtype: int64
In [1372]: l = ser[ser % 3 == 0].index.tolist()
In [1373]: l
Out[1373]: [7]
where l will be a list of indexes of elements which are a multiple of 3.
I have a code, which takes in files from csv and takes a price difference, but to make it simplar I made a reproducible example as seen below. I want to append each result to the end of a specific column name. For example the first loop will go through size 1 and minute 1 so it should append to column names 1;1, for file2, file 3, file4. So the output should be :
1;1 1;2 1;3 2;1 2;2 2;3
0 0 0 same below as for 1
0 0 0
2 2 2
2 2 2
4 4 4
4 4 4
5 5 5
0 0 0
0 0 0
0 0 0
2 2 2
2 2 2
4 4 4
4 4 4
6 6 6
6 6 6
0 0 0
0 0 0
0 0 0
2 2 2
2 2 2
4 4 4
4 4 4
6 6 6
7 7 7
I am using a loop to set prefixed data frame columns, because in my original code the number of minutes, sizes, and files is inputted by the user.
import numpy as np
import pandas as pd
file =[1,2,3,4,5,6,6,2]
file2=[1,2,3,4,5,6,7,8]
file3=[1,2,3,4,5,6,7,9]
file4=[1,2,1,2,1,2,1,2]
size=[1,2]
minutes=[1,2,3]
list1=[file,file2,file3]
data=pd.DataFrame(file)
data2=pd.DataFrame(file2)
data3=pd.DataFrame(file3)
list1=(data,data2,data3)
datas=pd.DataFrame(file4)
col_names = [str(sizer)+';'+str(number) for sizer in size for number in minutes]
datanew=pd.DataFrame(columns=col_names)
for sizes in size:
for minute in minutes:
for files in list1:
pricediff=files-data
datanew[str(sizes)+';'+str(minute)]=datanew[str(sizes)+';'+str(minute)].append(pricediff,ignore_index=True)
print(datanew)
Edit: When trying this line: datanew=datanew.append({str(sizes)+';'+str(minute): df['pricediff']},ignore_index=True) It appends the data but the result isn't "clean"
The result from my original data, gives me this:
111;5.0,1111;5.0
"0 4.5
1 0.5
2 8
3 8
4 8
...
704 3.5
705 0.5
706 11.5
707 0.5
708 9.0
Name: pricediff, Length: 709, dtype: object",
"price 0.0
0 0.0
Name: pricediff, dtype: float64",
"0 6.5
1 6.5
2 3.5
3 13.0
Name: pricediff, Length: 4, dtype: float64",
What you are looking for is:
datanew=datanew.append({str(sizes)+';'+str(minute): pricediff}, ignore_index=True)
This happens because you cannot change length of a single column of a dataframe without modifying length of the whole data frame.
Now consider the below as an example:
import pandas as pd
df=pd.DataFrame({"a": list("xyzpqr"), "b": [1,3,5,4,2,7], "c": list("pqrtuv")})
print(df)
#this will fail:
#df["c"]=df["c"].append("abc", ignore_index=True)
#print(df)
#what you can do instead:
df=df.append({"c": "abc"}, ignore_index=True)
print(df)
#you can even create new column that way:
df=df.append({"x": "abc"}, ignore_index=True)
Edit
In order to append pd.Series do literally the same:
abc=pd.Series([-1,-2,-3], name="c")
df=df.append({"c": abc}, ignore_index=True)
print(df)
abc=pd.Series([-1,-2,-3], name="x")
df=df.append({"x": abc}, ignore_index=True)
import pandas as pd
test=[
[14,12,1,13,15],
[11,21,1,19,32],
[48,16,1,16,12],
[22,24,1,18,41],
]
df = pd.DataFrame(test)
x = [1,2,3,4]
df['new'] = pd.DataFrame(x)
In this example,df will create new column 'new'
What I want is ...
I want create an new DataFrame (df1) include column 'new'(six column), and df is not changed (only five column).
I want df not to change.
How do I do that?
You can create the new DataFrame with .assign:
import pandas as pd
df= pd.DataFrame(test)
df1 = df.assign(new=x)
print(df)
0 1 2 3 4
0 14 12 1 13 15
1 11 21 1 19 32
2 48 16 1 16 12
3 22 24 1 18 41
print(df1)
0 1 2 3 4 new
0 14 12 1 13 15 1
1 11 21 1 19 32 2
2 48 16 1 16 12 3
3 22 24 1 18 41 4
.assign returns a new object, so you can modify it without affecting the original. The other alternative would be
df1 = df.copy() #New object, modifications do not affect `df`.
df1['new'] = x
Alternative way, 'e' is new column, np random creates random values for the new column
df.insert(len(df.columns),'e',np.random.randint(0,5,(5,1)))