Select nearest values around a certain value in column - pandas

I have the following df and I need to find all values which are equal or nearest to 660 in column value.
In detail this means I must iterate somehow through the column value to find all these 660 or nearest values. The values in column value are in a range from 1 to (the end varies) and when the end is reached the values start again from 1. Finally, I must select the entire row of all other columns where value == 660 (or neareest). I have more or less a 'helper' column helper which has the same value during a value range of 1 to end. It could be helpfull to get the result (column help is always 0 or 1). Here is the df example:
helper value
0 1
.
.
.
0 647
0 649
0 652
0 654
0 656
0 659
0 661
0 663
0 665
0 667
0 669
0 672
0 674
0 676
0 678
0 681
.
.
.
0 1000
1 1
.
.
.
1 647
1 649
1 652
1 654
1 656
1 659
1 661
1 663
1 665
1 667
1 669
1 672
1 674
1 676
1 678
1 681
.
.
1 1500
0 1
.
.
.
0 645
0 647
0 650
0 652
0 654
0 656
0 658
0 661
0 663
0 665
0 667
0 669
0 672
0 674
0 676
0 679
.
.
.
0 980
Thanks for any help or hints!

There is not enough info for a correct answer. What is the size of the dataframe? Also what do you mean by nearest? The solution to "where value == 660" can be done by applying a mask to the pandas series.
Something like df = df[(df['value'] == 660)] should do.
While if you have a range of numbers you can use boolean operators for the mask:
mask = df.value > 650 & df.value < 670
df = df[mask]
Let me know if this is what you are looking for

Related

How to read a file in python?

I have a file of the format .lammpstrj which has the 15 million rows of data points of the following pattern:
id mol type x y z
100000
ITEM: NUMBER OF ATOMS
3000
ITEM: BOX BOUNDS pp pp pp
4.8659189006091452e-02 3.0951340810994285e+01
4.8659189006091452e-02 3.0951340810994285e+01
4.8659189006091452e-02 3.0951340810994285e+01
ITEM: ATOMS id mol type x y z
2325 775 1 4.5602 4.35401 4.16348
718 240 2 4.91829 3.19545 7.30041
2065 689 2 6.51189 1.25778 5.11324
639 213 1 6.84357 5.10011 0.530398
720 240 1 5.46433 3.36715 6.48044
694 232 2 0.107046 3.3119 7.42581
1855 619 2 6.17236 4.57208 5.02607
1856 619 1 6.65988 5.13298 5.69518
I want to store all this data into a pandas DataFrame. How can I do the following task?

How can I merge two files while printing a given value on resulting empty fields using AWK?

I have two files:
01File:
1 2051
2 1244
7 917
X 850
22 444
21 233
Y 47
KI270728_1 6
KI270727_1 4
KI270734_1 3
KI270726_1 2
KI270713_1 2
GL000195_1 2
GL000194_1 2
KI270731_1 1
KI270721_1 1
KI270711_1 1
GL000219_1 1
GL000218_1 1
GL000213_1 1
GL000205_2 1
GL000009_2 1
and 02File:
1 248956422
2 242193529
7 159345973
X 156040895
Y 56887902
22 50818468
21 46709983
KI270728_1 1872759
KI270727_1 448248
KI270726_1 43739
GL000009_2 201709
KI270322_1 21476
GL000226_1 15008
KI270311_1 12399
KI270366_1 8320
KI270511_1 8127
KI270448_1 7992
I need to merge these two files based on Field 01 and print "0"s on resulting empty fields.
I was trying to accomplish this using the following command:
awk 'FNR==NR{a[$1]=$2 FS $3;next}{ print $0 "\t" a[$1]}' 01File 02File
Which results in the following output:
1 248956422 2051
2 242193529 1244
7 159345973 917
X 156040895 850
Y 56887902 47
22 50818468 444
21 46709983 233
KI270728_1 1872759 6
KI270727_1 448248 4
KI270726_1 43739 2
GL000009_2 201709 1
KI270322_1 21476
GL000226_1 15008
KI270311_1 12399
KI270366_1 8320
KI270511_1 8127
KI270448_1 7992
However, I am having trouble adapting the command so as to be able to print, in this case a value of zero "0" on the resulting empty fields, so as to generate the following output:
1 248956422 2051
2 242193529 1244
7 159345973 917
X 156040895 850
Y 56887902 47
22 50818468 444
21 46709983 233
KI270728_1 1872759 6
KI270727_1 448248 4
KI270726_1 43739 2
GL000009_2 201709 1
KI270322_1 21476 0
GL000226_1 15008 0
KI270311_1 12399 0
KI270366_1 8320 0
KI270511_1 8127 0
KI270448_1 7992 0
I would be grateful if you can get me going in the right direction
Use a conditional expression in place of a[1]. Instead of the empty string, "0" will be printed if no line matched.
awk 'FNR==NR{a[$1]=$2;next} {print $0 "\t" ($1 in a? a[$1]: "0")}' 01File 02File
Also I simplified the first action, as there are only 2 fields.
Output:
1 248956422 2051
2 242193529 1244
7 159345973 917
X 156040895 850
Y 56887902 47
22 50818468 444
21 46709983 233
KI270728_1 1872759 6
KI270727_1 448248 4
KI270726_1 43739 2
GL000009_2 201709 1
KI270322_1 21476 0
GL000226_1 15008 0
KI270311_1 12399 0
KI270366_1 8320 0
KI270511_1 8127 0
KI270448_1 7992 0

Multilevel Column Pivot

I am Learning Python using pandas, I do not know, how to pivot a data frame with columns with a multilevel index. I have the following pivot table :
df= df.pivot_table(index=["FECHA",'Planta'],
aggfunc = {'Menor_F0' :np.sum, 'Menor_fc' :np.sum,
"Total_Muestras " : "count"
})
it gives: PD: it is correct
Menor_F0 Menor_fc Total_Muestras
FECHA Planta
01/2014 455 0 0 2
470 1 2 5
01/2016 455 0 0 1
470 0 1 2
But I want to visualize it, in this form, how can I do it?
FECHA 01/2014 01/2016
Menor_F0 Menor_fc Total_Muestras Menor_F0 Menor_fc Total_Muestras
PLANTA
455 0 0 2 0 0 1
470 1 2 5 0 1 2
You can try stack and unstack:
df.stack().unstack(level=(0,2))

How to use a list of categories that example belongs to as a feature solving classification problem?

One of features looks like this:
1 170,169,205,174,173,246,247,249,380,377,383,38...
2 448,104,239,277,276,99,154,155,76,412,139,333,...
3 268,422,419,124,1,17,431,343,341,435,130,331,5...
4 50,53,449,106,279,420,161,74,123,364,231,18,23...
5 170,169,205,174,173,246,247,249,380,377,383,38...
It tells us what categories the example belongs to.
How should I use it while solving classification problem?
I've tried to use dummy variables,
df=df.join(features['cat'].str.get_dummies(',').add_prefix('contains_'))
but we don't know where there are some other categories that were not mentioned in the training set, so, I do not know how to preprocess all the objects.
That's interesting. I didn't know str.get_dummies, but maybe I can help you with the rest.
You basically have two problems:
The set of categories you get later contains categories that were unknown while training the model. You have to get rid of these later.
The set of categories you get later does not contain all categories. You have to make sure, you generate dummies for them as well.
Problem 1: filtering out unknown/unwanted categories
The first problem is easy to solve:
# create a set of all categories, you want to allow
# either definie it as a fixed set, or extract it from your
# column like this (the output of the map is actually irrelevant)
# the result will be in valid_categories
valid_categories= set()
df['categories'].str.split(',').map(valid_categories.update)
# now if you want to normalize your data before you do the
# dummy encoding, you can cleanse the data by
# splitting it, creating an intersection and then joining
# it back again to get a string on which you can work with
# str.get_dummies
df['categories'].str.split(',').map(lambda l: valid_categories.intersection(l)).str.join(',')
Problem 2: generating dummies for all known categories
The second problem can be solved by just adding a dummy row, that
contains all categories e.g. with df.append just before you
call get_dummies and removing it right after get_dummies.
# e.g. you can do it like this
# get a new index value to
# be able to remove the row later
# (this only works if you have
# a numeric index)
dummy_index= df.index.max()+1
# assign the categories
#
df.loc[dummy_index]= {'id':999, 'categories': ','.join(valid_categories)}
# now do the processing steps
# mentioned in the section above
# then create the dummies
# after that remove the dummy line
# again
df.drop(labels=[dummy_index], inplace=True)
Example:
import io
raw= """id categories
1 170,169,205,174,173,246,247
2 448,104,239,277,276,99,154
3 268,422,419,124,1,17,431,343
4 50,53,449,106,279,420,161,74
5 170,169,205,174,173,246,247"""
df= pd.read_fwf(io.StringIO(raw))
valid_categories= set()
df['categories'].str.split(',').map(valid_categories.update)
# remove 154 and 170 for demonstration purposes
valid_categories.remove('170')
valid_categories.remove('154')
df['categories'].str.split(',').map(lambda l: valid_categories.intersection(l)).str.join(',').str.get_dummies(',')
Out[622]:
1 104 106 124 161 169 17 173 174 205 239 246 247 268 276 277 279 343 419 420 422 431 448 449 50 53 74 99
0 0 0 0 0 0 1 0 1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 1
2 1 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 1 1 0 1 1 0 0 0 0 0 0
3 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 1 1 0
4 0 0 0 0 0 1 0 1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
You can see, that there are not columns for 154 and 170.

How can I apply MinMaxScaler in my dataset? group by NUM(ID) and for each columns

Num Year 1 2 3 4 5 6 7 8 9 10 11 12
111 2009 0 0 667 584 343 245 0 0 231 742 0 482
111 2010 0 10 624 4 143 445 20 991 0 742 0 482
112 2008 .....
Dataset is like this.
Num, Year : Multiindex in Dataframe(pandas)
And desired output is scaled data of columns from 1 to 12 using group by Num.
So I can compare the Usage(for each months) of Each year only with same 'Num'.
print(df_pivot.groupby('Num')[1].apply(lambda x: MinMaxScaler().fit_transform(x.values.reshape(-1,1))))
print(df_pivot.groupby('Num')[1].apply(lambda x: MinMaxScaler().fit_transform(x.values.reshape(-1,1))).loc[11111])
########
First Output is :
11111 [[0.0], [0.0], [0.8012012012012012], [0.690210...
11112 [[0.5682020471740098], [0.5516243880729862], [...
11115 [[0.0], [0.0], [0.9999999999999999], [0.747932...
-----------------
Second Output is :
[[0. ]
[0.97836918]
[0.91226151]
[1. ]
[0.68939118]
[0.51707376]
[0.50893183]]
If look at the First Output, there is no Year index but only the Num index.
So the result is returned in form of pandas.Series but this is not the desired output.
Can somebody help to get the data like??
11111 2010 0.0 0.3 0.4 1.0 0.5 ....
11111 2011 0.8 ....
11111 2012 0.2 ....
....
Instead of:
df_pivot.groupby('Num')[1]
you have to write:
df_pivot.groupby(['Num','Year'])
For example:
print(df_pivot.groupby(['Num','Year']).apply(lambda x: MinMaxScaler().fit_transform(x.values.reshape(-1,1)))))