uniq does not clean out duplicate entries - osx-yosemite

I have a file with a number on each line. I am trying to find out the distinct set of numbers. Below is an excerpt:
550
400
4000
400
1900
550
5000
400
1500
1900
5000
4000
5000
1900
5000
Passing this through uniq, however, doesn't clean out all the duplicates. The number of lines is reduced to 256 from 699, but there are still multiple lines with 400 or 550, etc.
I generated this file with a python script, so I know for a fact that on each line there is a blank followed by a number, and followed by \n. At least that's what I am printing within the code.
I do not understand what is wrong with the file. Why is uniq not working as I thought it would?
(OS X Yosemite, python 2.7)

You need to sort your contents before calling uniq, or you could just use sort -u.
From the uniq man page:
uniq - report or omit repeated lines
Note how it says repeated and not duplicate.

Related

GitLab API - get the overall # of lines of code

I'm able to get the stats (additions, deletions, total) for each commit, however how can I get the overall #?
For example, if one MR has 30 commits, I need the net # of lines of code added\deleted which you can see in the top corner.
This # IS NOT the sum of all #'s per commit.
So, I would need an API that returns the net # of lines of code added\removed at MR level (no matter how many commits are).
For example, if I have 2 commits: 1st one adds 10 lines, and the 2nd one removes the exact same 10 lines, then the net # is 0.
Here is the scenario:
I have an MR with 30 commits.
GitLab API provides support to get the stats (lines of code added\deleted) per Commit (individually).
If I go in GitLab UI, go to the MR \ Changes, I see the # of lines added\deleted that is not the SUM of all the Commits stats that I'm getting thru API.
That's my issue.
A simpler example: let's say I have 2 commits, one adds 10 lines of code, while the 2nd commit removes the exact same 10 lines of code. Using the API, I'm getting the sum, which is 20 LOCs added. However, if I go in the GitLab UI \ Changes, it's showing me 0 (zero), which is correct; that's the net # of chgs overall. This is the inconsistency I noticed.
To do this for an MR, you would use the MR changes API and count the occurrences of lines starting with + and - in the changes[].diff fields to get the additions and deletions respectively.
Using bash with gitlab-org/gitlab-runner!3195 as an example:
GITLAB_HOST="https://gitlab.com"
PROJECT_ID="250833"
MR_ID="3195"
URL="${GITLAB_HOST}/api/v4/projects/${PROJECT_ID}/merge_requests/${MR_ID}/changes"
DIFF=$(curl ${URL} | jq -r ".changes[].diff")
ADDITIONS=$(grep -E "^\+" <<< "$DIFF")
DELETIONS=$(grep -E "^\-" <<< "$DIFF")
NUM_ADDITIONS=$(wc -l <<< "$ADDITIONS")
NUM_DELETIONS=$(wc -l <<< "$DELETIONS")
echo "${MR_ID} has ${NUM_ADDITIONS} additions and ${NUM_DELETIONS} deletions"
The output is
3195 has 9 additions and 2 deletions
This matches the UI, which also shows 9 additions and 2 deletions
This, as you can see is a representative example of your described scenario since the combined total of the individual commits in this MR are 13 additions and 6 deletions.

Appending the datetime to the end of every line in a 600 million row file

I have a 680 million rows (19gig) file that I need the datetime appended onto every line. I get this file every night and I have to add the time that I processed it to the end of each line. I have tried many ways to do this including sed/awk and loading it into a SQL database with the last column being defaulted to the current timestamp.
I was wondering if there is a fast way to do this? My fastest way so far takes two hours and that is just not fast enough given the urgency of the information in this file. It is a flat CSV file.
edit1:
Here's what I've done so far:
awk -v date="$(date +"%Y-%m-%d %r")" '{ print $0","date}' lrn.ae.txt > testoutput.txt
Time = 117 minutes
perl -ne 'chomp; printf "%s.pdf\n", $_' EXPORT.txt > testoutput.txt
Time = 135 minutes
mysql load data local infile '/tmp/input.txt' into table testoutput
Time = 211 minutes
You don't specify if the timestamps have to be different for each of the lines. Would a "start of processing" time be enough?
If so, a simple solution is to use the paste command, with a pre-generated file of timestamps, exactly the same length as the file you're processing. Then just paste the whole thing together. Also, if the whole process is I/O bound, as others are speculating, then maybe running this on a box with an SSD drive would help speed up the process.
I just tried it locally on a 6 million row file (roughly 1% of yours), and it's actually able to do it in less than one second, on Macbook Pro, with an SSD drive.
~> date; time paste file1.txt timestamps.txt > final.txt; date
Mon Jun 5 10:57:49 MDT 2017
real 0m0.944s
user 0m0.680s
sys 0m0.222s
Mon Jun 5 10:57:49 MDT 2017
I'm going to now try a ~500 million row file, and see how that fares.
Updated:
Ok, the results are in. Paste is blazing fast compared to your solution, it took just over 90 seconds total to process the whole thing, 600M rows of simple data.
~> wc -l huge.txt
600000000 huge.txt
~> wc -l hugetimestamps.txt
600000000 hugetimestamps.txt
~> date; time paste huge.txt hugetimestamps.txt > final.txt; date
Mon Jun 5 11:09:11 MDT 2017
real 1m35.652s
user 1m8.352s
sys 0m22.643s
Mon Jun 5 11:10:47 MDT 2017
You still need to prepare the timestamps file ahead of time, but that's a trivial bash loop. I created mine in less than one minute.
A solution that simplifies mjuarez' helpful approach:
yes "$(date +"%Y-%m-%d %r")" | paste -d',' file - | head -n "$(wc -l < file)" > out-file
Note that, as with the approach in the linked answer, you must know the number of input lines in advance - here I'm using wc -l to count them, but if the number is fixed, simply use that fixed number.
yes keeps repeating its argument indefinitely, each on its own output line, until it is terminated.
paste -d',' file - pastes a corresponding pair of lines from file and stdin (-) on a single output line, separated with ,
Since yes produces "endless" output, head -n "$(wc -l < file)" ensures that processing stops once all input lines have been processed.
The use of a pipeline acts as a memory throttle, so running out of memory shouldn't be a concern.
Another alternative to test is
$ date +"%Y-%m-%d %r" > timestamp
$ join -t, -j9999 file timestamp | cut -d, -f2-
or time stamp can be generated in place as well <(date +"%Y-%m-%d %r")
join creates a cross product of the first file and second file using the non-existing field (9999), and since second file is only one line, practically appending it to the first file. Need the cut to get rid of the empty key field generated by join
If you want to add the same (current) datetime to each row in the file, you might as well leave the file as it is, and put the datetime in the filename instead. Depending on the use later, the software that processes the file could then first get the datetime from the filename.
To put the same datetime at the end of each row, some simple code could be written:
Make a string containing a separator and the datetime.
Read the lines from the file, append the above string and write back to a new file.
This way a conversion from datetime to string is only done once, and converting the file should not take much longer than copying the file on disk.

awk match pattern from file

I have very large data sets in which I need to find specific patterns located in a specific column index and need the entire line output. I've gotten [successfully] as far as a single cmd line pattern match:
awk -F'|' -v OFS='|' '$1=="100002"{print $1,$22,$11,$12,$13,$28,$25,$27}' searchfile > outfile
100022 - being the search pattern, is a an exact match and located in column 1
searchfile - is the data file with 3.8 million lines and 60 columns all | delimited
Now I want to modify this search by specifying an input patternfile, because I have a little over 800 patterns that need to be matched and outputted. I've done my best to search the site and did find the use of the -f flag however I don't know how to integrate that with my search criteria per above. I need to be able to specify: exact match, specific column index search, specify specific columns to output, and specific in/out delimiter.
sample data set (note this has been modified to protect data owner):
100001|0|60|100001|AAR Corp| | |Industrial|Aerospace/Defense|Aerospace/Defense-Equip|US|US|US|IL|DE|;2;6;1;1;1100 North Wood Dale Road;1; ;1;Wood Dale;1;IL;1;60191;1;United States;|
15460796|0|60|15460796|PayPal Data Services Inc|348546|eBay Inc|Consumer, Non-cyclical|Commercial Services|Inactive/Unknown|US|US|US|CA|DE|;2;6;1;1;2211 North 1st Street;1; ;1;San Jose;1;CA;1;95125;1;United States;|
100003|0|60|100003|Abex Inc|170435|Mafco Consolidated Group Inc|Industrial|Aerospace/Defense|Aerospace/Defense-Equip|US|US|US|NH|DE|;2;6;1;1;Liberty Lane;1; ;1;Hampton;1;NH;1;03842;1;United States;|
100004|0|60|100004|Abitibi-Consolidated Inc|23165941|Resolute Forest Products Inc|Basic Materials|Forest Products&Paper|Paper&Related Products|CA|CA|CA|QC|QC|;2;6;1;1;1155 Metcalfe Street;1;Suite 800;1;Montreal;1;QC;1;M5J 2P5;1;Canada;|
100005|0|60|100005|Acme Electric Corp|100763|Hubbell Inc|Industrial|Electrical Compo&Equip|Power Conv/Supply Equip|US|US|US|NC|NY|;2;6;1;1;400 Quaker Road;1; ;1;East Aurora;1;NY;1;14052;1;United States;|
100006|0|60|100006|ACME-Cleveland Corp|100430|Danaher Corp|Industrial|Hand/Machine Tools|Mach Tools&Rel Products|US|US|US|OH|OH|;2;6;1;1;30100 Chagrin Boulevard;1;Suite 100;1;Pepper Pike;1;OH;1;44124-5705;1;United States;|
100007|0|60|100007|Acuson Corp|196005|Siemens Corp|Consumer, Non-cyclical|Healthcare-Products|Ultra Sound Imaging Sys|US|US|US|CA|DE|;2;6;1;1;1220 Charleston Road;1; ;1;Mountain View;1;CA;1;94039;1;United States;|
100009|0|60|100009|ADT Ltd|101520|Tyco International Plc|Consumer, Non-cyclical|Commercial Services|Protection-Safety|BM|BM|BM| | |;2;6;1;1;Cedar House;1;41 Cedar Avenue;1;Hamilton;1; ;1;HM 12;1;Bermuda;|
100010|0|60|100010|Advanced Micro Devices Inc| | |Technology|Semiconductors|Electronic Compo-Semicon|US|US|US|CA|DE|;2;6;1;1;One AMD Place;1;PO Box 3453;1;Sunnyvale;1;CA;1;94088-3453;1;United States;|
input pattern search:
100006
100052
You can externalize all the variables from the script
$ awk -v sep='|' -v matchindex='1' -v matchvalue='100002' -v columns='1,22,11,12,13,28,25,27'
'BEGIN{FS=OFS=sep; n=split(columns,c,",")}
$matchindex==matchvalue{for(i=1;i<n;i++)
printf "%s",$c[i] OFS; printf "%s\n", $c[n]}'
and perhaps write another script to generate the first line from a config file.

How can I remove lines from a file with more than a certain number of entries

I've looked at the similar question about removing lines with more than a certain number of characters and my problem is similar but a bit trickier. I have a file that is generated after analyzing some data and each line is supposed to contain 29 numbers. For example:
53.0399 0.203827 7.28285 0.0139936 129.537 0.313907 11.3814 0.0137903 355.008 \
0.160464 12.2717 0.120802 55.7404 0.0875189 11.3311 0.0841887 536.66 0.256761 \
19.4495 0.197625 46.4401 2.38957 15.8914 17.1149 240.192 0.270649 19.348 0.230\
402 23001028 23800855
53.4843 0.198886 7.31329 0.0135975 129.215 0.335697 11.3673 0.014766 355.091 0\
.155786 11.9938 0.118147 55.567 0.368255 11.449 0.0842612 536.91 0.251735 18.9\
639 0.184361 47.2451 0.119655 18.6589 0.592563 240.477 0.298805 20.7409 0.2548\
56 23001585
50.7302 0.226066 7.12251 0.0158698 237.335 1.83226 15.4057 0.059467 -164.075 5\
.14639 146.619 1.37761 55.6474 0.289037 11.4864 0.0857042 536.34 0.252356 19.3\
91 0.198221 46.7011 0.139855 20.1464 0.668163 240.664 0.284125 20.3799 0.24696\
23002153
But every once in a while, a line like the first one appears that has an extra 8 digit number at the end from analyzing an empty file (so it just returns the file ID number but not on a new line like it should). So I just want to find lines that have this extra 30th number and remove just that 30th entry. I figure I could do this with awk but since I have little experience with it I'm not sure how. So if anyone can help I'd appreciate it.
Thanks
Summary: Want to find lines in a text file with an extra entry in a row and remove the last extra entry so all rows have same number of entries.
With awk, you tell it how many fields there are per record. The extras are ignored
awk '{NF = 29; print}' filename
If you want to save that back to the file, you have to do a little extra work
awk '{NF = 29; print}' filename > filename.new && mv filename.new filename

ksh loop putting results of file into variable

I'm not sure the best way to handle this, I'm guessing it's using a while loop. I have a .txt file with a set of numbers ( these numbers can change based on another script that runs )
ex:
0
36
41
53
60
Each number is on it's own line. For each number I want to get that number and execute a script using it. So in this example I would call a script to stop database 0, after that completes call a script to stop database 36 and so on until it's complete with all numbers in the list.
1) Is a while loop the best way to handle this?
2) I'm having trouble trying to determine what the [[condition]] needs to be to get each number 1 at a time, where can i find some additional help on this?
while [[ condition ]] ; do
command1
done
For testing purposes the file that contains all the numbers is test.txt. The script that will execute is a python script - "amgr.py stop (number from test.txt)"
Here's a simpler way:
cat test.txt | xargs amgr.py stop
this will get each line of your file, and then put it as an extra parameter for your amgr.py:
amgr.py stop 0
amgr.py stop 36
and so on..
I ended up using this method to give me the results i was looking for.
while read -r line ; do
amgr.py stop $line
done<test.txt