Match and add exactly line - line

folks! I am new in programming and stuck in some moments. So, I have two files:
file1 contains:
s145 12 32 56
s430 48 56 20
s76 45 arg in
file2 contains only the name of s values:
s145 protos
s430 retus
s76 cosess
I want to add name of s values in new line, after the last column. Could someone show me how I can solve this issue. Thanks.

Related

Removing the .0 from a pandas column

After a simple merge of two dataframes the following X column becomes an object and an ".0" is being added at the end for no apparent reason. I tried replacing the nan values with an integer and then converting the whole column to an integer hoping for the .0 to be gone. The code runs but it doesn't really change the dtype of that column. Also, I tried removing the .0 with the rstrip command but then all it really does is it removes everything and even the values that are 249123.0 become NaN which doesn't make sense. I know that is a very basic issue but I am not sure what else could I try at this point.
Input:
Age ID
22 23105.0
34 214541.0
51 0
8 62341.0
Desired output:
Age ID
22 23105
34 214541
51 0
8 62341
Any ideas would be much appreciated.
One of the ways to get rid of the trailing .0 in an object column is to use pandas.DataFrame.replace :
df['ID'] = df['ID'].replace(r'\.0$', '', regex=True).astype(np.int64)
# Output :
print(df)
Age ID
0 22 23105
1 34 214541
2 51 0
3 8 62341

Pandas Python How to handle question mark that appeared in dataframe

I have these question marks that appeared in my data frame just next to numbers and I dont know how to erase or or replace them. I dont want to drop the whole row since it may result in inaccurate results.
. Value
0 58
1 82
2 69
3 48
4 8
I agree with the comments above that you should look into how you imported the data. But here is the answer to your question of how to remove the non numeric characters:
This will remove the non numeric characters
df['Value'] = df['Value'].str.extract('(\d+)')
Then if you wish to change the datatype to in you can use this:
df['Value'] = pd.to_numeric(df['Value'])

Creating new column based on condition and extracting respective value from other column. Pandas Dataframe

I am relatively new to this field and am working with a data set to find meaningful insights into customer behavior. My dataset looks like:
customerId week first_trip_week rides
0 156 44 36 2
1 164 44 38 6
2 224 42 36 5
3 224 43 36 4
4 224 44 36 5
What I want to do is create new columns week 44,week 43, week 42 and get the values in the "ride" column to be filled into the rows for the respective customer id. This is in the hope that I can eventually also make the customerId my index and can get denominations for different weeks. Help would be greatly appreciated!
Thank you!!
If I'm understanding you correctly, you want to create new columns in the same dataframe for weeks 44, 43, and 42 with the correct values for each customerId and NaN for those that don't have it. If your original dataframe has all the user data, I would first filter for dataframes that have the correct week number
week42DF = dataset.loc[dataset['week']==42,['customerId','rides']].rename(columns={'rides':'week42Rides'})
getting only the rides and customerId and renaming the former here to make things a little easier for us. Then left join the old dataframe and the new one on customerId
dataset = pd.merge(dataset,week42DF,how='left',on='customerId')
The users that are missing from week42DF will have NaN in the week42rides column in the merged dataset which you can then use the .fillna(0) method to replace with zeros. Do this for each week you require.
See Pandas' documentation on merge and the more general concatenate for more info.

Distance between two lines

I have a set of points for which I need to calculate the distance between lines.
Especially for the range 70:80. Can it be possible via awk ? or any other method
sample data
70.9247 24
73.6148 24
70.9231 25
73.6144 25
70.9216 26
73.6141 26
70.9201 27
73.6138 27
70.9187 28
73.6136 28
Few points
1) Data sorted on y. So each value of y has 2 points.
2) I want the distance between x points for every y. i.e. y(new) = y(n+1)-y(n)
expected output:
2.6901 24
2.6912 25
...........
2.6949 28
thanks
What you are after is something like:
awk 'NR%2{t=$1;next}{print $1-t,$2}'
This does something like:
If the record/line number NR is an odd number, store the value of the first field in t and skip to the next record/line
Otherwise, print the expected output.
A similar way of writing this is:
awk '{if(NR%2){t=$1}else{print $1-t,$2}}'
but this is less awk-ish!

Find average of numbers from a specific line

I have a text file with 2 columns of numbers.
10 2
20 3
30 4
40 5
50 6
60 7
70 8
80 9
90 10
100 11
110 12
120 13
130 14
I would like to find the average of the 2nd column data from the 6th line. That is ( (7+8+9+10+11+12+13+14)/8 = 10.5 )
I could find this post Scripts for computing the average of a list of numbers in a data file
and used the following:
awk'{s+=$2}END{print "ave:",s/NR}' fileName
but I get an average of entire second column data.
Any hint here.
This one-liner should do:
awk -v s=6 'NR<s{next} {c++; t+=$2} END{printf "%.2f (%d samples)\n", t/c, c}' file
This awk script has three pattern/action pairs. The first is responsible for skipping the first s lines. The second executes on every line (from s onwards); it increments a counter and adds column 2 to a running total. The third runs after all data have been processed, and prints your results.
Below script should do the job
awk 'NR>=6{avg+=$2}END{printf "Average of field 2 starting from 6th line %.1f\n",avg/(NR-5)}' file
Output
Average of field 2 starting from 6th line 10.5