How to parse the XML output from postgres as input for basex in Linux - sql

How can I parse the XML output from Postgres as an input for Basex in Linux?

oh I see may answer is somehow outdated; yet I'll leave it here as in my opinion the appraoch you describe in your answer might be overkill for the task at hand.
I am not sure if you even have a question, yet I'd like to propose a fundamentally leaner approach ;-)
I hope it helps a little! Have fun!
For the current use case you may throw away awk, sed, postgres and wget, you can do all that you need in 25 lines of XQuery:
1) Some basics, fetch a file from a remote server:
fetch:text('https://www.wien.gv.at/statistik/ogd/vie_101.csv')
2) Skip the first line.
I decided to use the header that came with the original file, but you
fetch:text('https://www.wien.gv.at/statistik/ogd/vie_101.csv')
=> tokenize(out:nl()) (: Split string by newline :)
=> tail() (: Skip first line :)
=> string-join(out:nl()) (: Join strings with newline :)
So in total your Requirements condense to:
RQ1.:
(: Fetch CSV as Text, split it per line, skip the first line: :)
let $lines := fetch:text('https://www.wien.gv.at/statistik/ogd/vie_101.csv')
=> tokenize(out:nl()) (: Split string by newline :)
=> tail() (: Skip first line :)
=> string-join(out:nl()) (: Join strings with newline :)
(: Parse the csv file, first line contains element names.:)
let $csv := csv:parse($lines, map { "header": true(), "separator": ";"})
for $record in $csv/csv/record
group by $date := $record/REF_DATE
order by $date ascending
return element year_total {
attribute date { $date },
attribute population { sum($record/POP_TOTAL) => format-number("0000000")}
}
RQ 2.:
(: Fetch CSV as Text, split it per line, skip the first line: :)
let $lines := fetch:text('https://www.wien.gv.at/statistik/ogd/vie_101.csv')
=> tokenize(out:nl()) (: Split string by newline :)
=> tail() (: Skip first line :)
=> string-join(out:nl()) (: Join strings with newline :)
(: Parse the csv file, first line contains element names.:)
let $csv := csv:parse($lines, map { "header": true(), "separator": ";"})
for $record in $csv/csv/record
group by $date := $record/REF_DATE
order by $date ascending
return element year_total {
attribute date { $date },
attribute population { sum($record/POP_TOTAL) => format-number("0000000")},
for $sub_item in $record
group by $per-district := $sub_item/DISTRICT_CODE
return element district {
attribute name { $per-district },
attribute population { sum($sub_item/POP_TOTAL) => format-number("0000000")}
}
}
Including the file write and the date formatted in a more readable way:
(: wrap elements in single root element :)
let $result := element result {
(: Fetch CSV as Text, split it per line, skip the first line: :)
let $lines := fetch:text('https://www.wien.gv.at/statistik/ogd/vie_101.csv')
=> tokenize(out:nl()) (: Split string by newline :)
=> tail() (: Skip first line :)
=> string-join(out:nl()) (: Join strings with newline :)
(: Parse the csv file, first line contains element names.:)
let $csv := csv:parse($lines, map { "header": true(), "separator": ";"})
for $record in $csv/csv/record
group by $date := $record/REF_DATE
order by $date ascending
return element year_total {
attribute date { $date => replace("^(\d{4})(\d{2})(\d{2})","$3.$2.$1")},
attribute population { sum($record/POP_TOTAL) => format-number("0000000")},
for $sub_item in $record
group by $per-district := $sub_item/DISTRICT_CODE
return element district {
attribute name { $per-district },
attribute population { sum($sub_item/POP_TOTAL) => format-number("0000000")},
$sub_item
}
}
}
return file:write("result.xml", $result)

Setup
Data source : http://www.wien.gv.at/statistik/ogd/vie_101.csv
Research questions (RQ):
RQ1: How many people lived in Vienna in total per census?
RQ2: How many people lived in each Viennese district per census?
Preparation
In order to answer the RQ the postgre DB was chosen. Adhering to the proverbial saying “Where
there’s a shell, there’s a way” this code shows a neat solution for the BASH (CLI Debian/Ubuntu
flavored). Also, it is much easier to interact with postgre from the BASH when creating files needed
for further processing. Regarding the installation process please consult:
https://tecadmin.net/install-postgresql-server-on-ubuntu/
First download the file with wget:
cd /path/to/directory/ ;
wget -O ./vie_101.csv http://www.wien.gv.at/statistik/ogd/vie_101.csv ;
Then look at the file with your favorite spread sheet calculation program (Libre Office Calc).
vie_101 should be in UTF-8 encoding and probably uses a semicolumn \; delimiter. Open,
check, change, save.
Some reformatting is needed for ease of processing down the line. First, a header file is created
with the appropriate column names. Second, the downloaded file is “beheaded” (first 2 rows are
removed) and “cut” (into the columns of interest). Finally, it is attached to the header file.
echo 'DISTRICT,POPULATION,MALE,FEMALE,DATE' > ./vie.csv ;
declare=$(sed -e 's/,/ INT,/g' ./vie.csv)' INT' ;
sed 's/\;/\,/g' ./vie_101.csv | sed 's/\.//g' | tail -n+3 | cut -d ','
-f4,6-9 >> ./vie.csv ;
Postgre
In order to load data into postgres a schema needs to be created first:
echo "create table vie ( $declare );" | sudo -u postgres psql ;
In order to actually load data into postgres the previously created and formatted file (vie.csv)
needs to be copied into the folder accessible by postgres by the super user. Only then the copy
command can be executed to load data into postgres. It needs to be noted that root privileges are
required for this operation (sudo).
sudo cp ./vie.csv /var/lib/postgresql/ ;
echo "\copy vie from '/var/lib/postgresql/vie.csv' delimiter ',' csv
header ;" | sudo -u postgres psql ;
XML Schema
Before we create our XML document, we have to design the structure of our file. We decided to
create an XML schema (schema.xsd) instead of the DTD.
Our schema defines a root element , and its child which are complex elements.
The element can occur in any number. The children of element are ,
, , and . These 5 elements(siblings) are simple
elements and the defined value type is always an integer.
Create XML with Postgre
Since the ultimate goal is to answer the RQ via an xquery an xml file is needed. This file
(xml.xml) needs to be correctly formatted and well formed. As the next step the query_to_xml
command is piped to postgres -Aqt is used to:
-A [aligned mode disable, remove header and + at end of line]
-q [quiet output]
-t [tuples only, removes footer]
echo "select query_to_xml( 'select * from vie order by date asc', true,
false, 'vie' ) ;" | sudo -u postgres psql -Aqt > ./vie_data.xml ;
Now, it is important to export the schema of the table with table_to_xmlschema().
echo "select table_to_xmlschema( 'vie', true, false, '') ;" | sudo -u
postgres psql -Aqt > ./vie_schema.xsd ;
This concludes all tasks within postgre and the BASH. As last command basex can be launched.
basexgui
Xquery
Using basex the XML file can easily be validated against the schema with via:
validate:xsd('vie_data.xml', 'vie_schema.xsd')
The XML file can be imporet by clicking:
Database -> New
General -> Browse Select XML file.
Parsing Turn on "Enable Namespaces" if its not enabled.
OK
RQ1 can only be answered by grouping the the data by ‘DATE’ via a for-loop. Results are saved
via:
file:write( 'path/to/directory/file_name' ).
file:write( '/path/to/directory/population_year_total.xml',
for $row in //table/row
group by $date := $row/date
order by $date ascending
return <year_total date="{$date}"
population="{sum($row/population)}">
</year_total>)
RQ2 is answerd by nesting two for loops. The outer loop groups by DATE and returns the
POPULATION total for each DATE given. The inner loop groups by DISTRICT, hence, it returns a
sub-sum of the POPULATION.
file:write( '/path/to/directory/district_year_subtotal.xml',
for $row in //table/row
group by $date:= $row/date
order by $date ascending
return <sub_sum date="{$date}"
population="{sum($row/population)}">{
for $sub_item in $row
group by $district := $sub_item/district
order by $district ascending
return <sub_item district="{$district}"
population="{sum($sub_item/population)}"/>
}</sub_sum>)
Done

Related

Importing csv data to SQL using PowerShell

Hi Glorius People of the Interwebz!
I come to you with a humble question (please go easy on me, I am fairly OK in PowerShell, but my SQL skills are minimal... :( )
So I have been tasked with to write a powershell script to import data (from a number of csv files to a database) and I made a good progress, based on this (I heavily modified my version). All works dashingly, except one part: when I try to insert the values (I created a sort of "mapping file" to map the csv headers to the data), I can't seem to use the created string in the values part. So here is what I have:
This is my current code for powershell (ignore the comments)
This is a sample data csv
This is my mapping file
What I would want, is to replace the
VALUES(
'$($CSVLine.Invoice_Status_Text)',
'$($CSVLine.Invoice_Status)',
'$($CSVLine.Dispute_Required_Text)',
'$($CSVLine.Dispute_Required)',
'$($CSVLine.Dispute_Resolved_Text)',
'$($CSVLine.Dispute_Resolved)',
'$($CSVLine.Sub_Account_Number)',
'$($CSVLine.QTY)',
'$($CSVLine.Date_of_Service)',
'$($CSVLine.Service)',
'$($CSVLine.Amount_of_Service)',
'$($CSVLine.Total)',
'$($CSVLine.Location)',
'$($CSVLine.Dispute_Reason_Text)',
'$($CSVLine.Dispute_Reason)',
'$($CSVLine.Numeric_counter)'
);"
Part, for example with a string generated this way:
But when I replace the long - and honestly, boring to type - values with the $valueString, I get this type of error:
Incorrect syntax was encountered while parsing '$($'.
Not sure, if it matters, but my PS is 7.1
Any good people who can give a good suggestion on how to build the values from my text file...?
Ta,
F.
As commented, wrapping variables inside single-quotes takes the variable as written literally, so you do not get the value contained (7957), but a string like $($CSVLine.Numeric_counter) instead.
I don't do SQL a lot, but I think I would change the part where you construct the values to insert like this:
# demo, read the csv file in your example
$csv = Import-Csv D:\Test\test.csv -Delimiter ';'
# demo, these are the headers (or better yet, the Property Names to use from the objects in the CSV) as ARRAY
# (you use `$headers = Get-Content -Path 'C:\Temp\SQL\ImportingCSVsIntoSQLv1\config\headers.txt'`)
$headers = 'Invoice_Status_Text','Invoice_Status','Dispute_Required_Text','Dispute_Required',
'Dispute_Resolved_Text','Dispute_Resolved','Sub_Account_Number','QTY','Date_of_Service',
'Service','Amount_of_Service','Total','Location','Dispute_Reason_Text','Dispute_Reason','Numeric_counter'
# capture formatted blocks of values for each row in the CSV
$AllValueStrings = $csv | ForEach-Object {
# get a list of values using propertynames you have in the $headers
$values = foreach ($propertyName in $headers) {
$value = $_.$propertyName
# output the VALUE to be captured in $values
# for SQL, single-quote the string type values. Numeric values without quotes
if ($value -match '^[\d\.]+$') { $value }
else { "'{0}'" -f $value }
}
# output the values for this row in the CSV
$values -join ",`r`n"
}
# $AllValueStrings will now have as many formatted values to use
# in the SQL as there are records (rows) in the csv
$AllValueStrings
Using your examples, $AllValueStrings would yield
'Ready To Pay',
1,
'No',
2,
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
7957

AWK: finding common elements across arbitrary number of columns (either single column files or column matrix)

Problem
I have several files, each one column, and I want to compare each of them to one another to find what elements are contained across all files. Alternatively - if it is easier - I could make a column matrix.
Question
How can I find the common elements across multiple columns.
Request
I am not an expert at awk (obviously). So a verbose explanation of the code would be much appreciated.
Other
# joepvd made some code that was somewhat similar... https://unix.stackexchange.com/questions/216511/comparing-the-first-column-of-two-files-and-printing-the-entire-row-of-the-secon/216515#216515?newreg=f4fd3a8743aa4210863f2ef527d0838b
to find what elements are contained across all files
awk is your friend as you guessed. Use the procedure below
#Store the files in an array. Assuming all files in one place
filelist=( $(find . -maxdepth 1 -type f) ) #array of files
awk -v count="${#filelist[#]}" '{value[$1]++}END{for(i in value){
if(value[i]==count){printf "Value %d is found in all files\n",i}}}' "${filelist[#]}"
Note
We used -v count="${#filelist[#]}" to pass the total file count to awk Note # in the beginning of an array gives element count.
value[$1]++ increments the count of a value as seen in the file. Also it creates value[$1] if not already exist with the initial value zero.
This method fails, if a value appear in a file more than once.
And END block with awk is executed only at last, ie after every records from all the files have been processed.
If you can have the same value multiple times in a single file, we'll need to take care to only count it once for each file.
A couple of variations with GNU awk (which is needed for ARGIND to be available. It could be emulated by checking FILENAME but that's even uglier.)
gawk '{ A[$0] = or(A[$0], lshift(1, ARGIND-1)) }
END { for (x in A) if (A[x] == lshift(1, ARGIND) - 1) print x }'
file1 file2 file3
The array A is keyed by the values (lines), and holds a bitmap of the files in which a line has been found. For each line read, we set bit number ARGIND-1 (since ARGIND starts with one).
At the end of input, run through all saved lines, and print them if the bitmap is all ones (up to the number of files seen).
gawk 'ARGIND > LASTIND {
LASTIND = ARGIND; for (x in CURR) { ALL[x] += 1; delete CURR[x] }
}
{ CURR[$0] = 1 }
END { for (x in CURR) ALL[x] += 1;
for (x in ALL) if (ALL[x] == ARGIND) print x
}' file1 file2 file3
Here, when a line is encountered, the corresponding element in arrayCURR, is set (middle part). When the file number changes (ARGIND > LASTIND), values in array ALL are increased for all values set in CURR, and the latter is cleared. At the END of input, the values in ALL are updated for the last file, and the total count is checked against the total number of files, printing the ones that appear in all files.
The bitmap approach is likely slightly faster with large inputs, since it doesn't involve creating and walking through a temporary array, but the number of files it can handle is limited by the number of bits the bit operations can handle (which seems to be about 50 on 64-bit Linux).
In both cases, the resulting printout will be in essentially a random order, since associative arrays do not preserve ordering.
I'm going to assume that it's the problem that matters, not the implementation language so here's an alternative using perl:
#! /usr/bin/perl
use strict;
my %elements=();
my $filecount=#ARGV;
while(<>) {
$elements{$_}->{$ARGV}++;
};
print grep {!/^$/} map {
"$_" if (keys %{ $elements{$_} } == $filecount)
} (keys %elements);
The while loop builds a hash-of-hashes (aka "HoH". See man perldsc and man perllol for details. Also see below for an example), with the top level key being each line from each input file, and the second-level key being the names of the file(s) that value appeared in.
The grep ... map {...} returns each top-level key where the number of files it appears in is equal to the number of input files
Here's what the data structure looks like, using the example you gave to ilkkachu:
{
'A' => { 'file1' => 1 },
'B' => { 'file2' => 1 },
'C' => { 'file1' => 1, 'file2' => 1, 'file3' => 1 },
'E' => { 'file2' => 1 },
'F' => { 'file1' => 1 },
'K' => { 'file3' => 1 },
'L' => { 'file3' => 1 }
}
Note that if there happen to be any duplicates in a single file, that fact is stored in this structure and can be checked.
The grep before the map isn't strictly required in this particular example, but is useful if you want to store the result in an array for further processing rather than print it immediately.
With the grep, it returns an array of only the matching elements, or in this case just the single value C. Without it, it returns an array of empty strings plus the matching elements. e.g. ("", "", "", "", "C", "", ""). Actually, they return the elements with a newline (\n) at the end because I didn't use chomp in the while loop as I knew i'd be printing them directly. In most programs, i'd use chomp to strip newlines and/or carriage-returns.

How to match similar filenames and rename so that diff tools like Beyond Compare see them as a pair to perform a binary comparison?

I'm looking for the best approach to comparing files that I believe are identical but which have different filenames. Comparison tools like BeyondCompare are great but they don't yet handle different filenames - when comparing files in separate folders they attempt comparisons with the files that have the same name on either side.
(I don't work for or have a financial interest in BeyondCompare, but I use the tool a lot and find it has some great features).
There is MindGems Fast Duplicate File Finder for matching files in any location throughout several folder trees that have different names but this is based on CRC checks I believe, I am using this tool but I am only gradually trusting it, so far no faults but don't trust it as much as BeyondCompare yet. BeyondCompare offers the complete piece of mind of doing a full binary compare on the file.
In my case the files tend to have similar names, the difference being ordering of the words, punctuation, case differences and not all words present. So it's not easy to use a regex filter to match the files that some diff tools like Beyond Compare already provide because the file substrings can be out of order.
I'm looking for a way to match similar filenames before renaming the files to be the same and then 'feeding' them to a tool like BeyondCompare. Solutions could be scripts or perhaps in the form of an application.
At the moment I have an idea for an algorithm (to implement in Perl) to match the filenames to suit my problem whereby the filenames are similar as described above.
Can you suggest something better or a completely different approach?
Find a list of files with the exact same filesize
Make a hash of alphanumeric substrings from first file, using
non-alphanumeric characters or space as delimiter
Make a hash of alphanumeric substrings from second file, using
non-alphanumeric characters or space as delimiter
Match occurrences
Find which file has the highest number of substrings.
Calculate a percentage score for the comparison on the pair based on
number of matches divided by the highest number of substrings.
Repeat comparison for each file with every other file with the exact
file size
sort the pair comparisons by percentage score to get suggestions of
files to compare.
Rename one file in the pair so that it is the same as the other. Place in separate folders.
Run a comparison tool like BeyondCompare with the files, folder comparison mode.
As I already have Fast Duplicate File Finder Pro, this outputs a text report of the duplicates in CSV and XML format.
I will process the CSV to see the groupings and rename the files so that I can get beyond compare to do a full binary comparison on them.
Update:
And here is my code. This Perl script will look at each pair of files (in the directories/folders being compared) that are the same and rename one of them to be the same as the other so that the two folders can be run through Beyond Compare which will do a full binary compare (if the flatten folders option is switched on). Binary compare confirms the match so that means that one of each duplicate pair can be purged.
#!/usr/bin/perl -w
use strict;
use warnings;
use File::Basename;
my $fdffCsv = undef;
# fixed
# put matching string - i.e. some or all of path of file to keep here e.g. C:\\files\\keep\\ or just keep
my $subpathOfFileToKeep = "keep";
# e.g. jpg mp3 pdf etc.
my $fileExtToCompare = "jpg";
# changes
my $currentGroup = undef;
my $group = undef;
my $filenameToKeep = "";
my $path = undef;
my $name = undef;
my $extension = undef;
my $filename = undef;
open ( $fdffCsv, '<', "fast_duplicate_filefinder_export_as_csv.csv" );
my #filesToRenameArray = ();
while ( <$fdffCsv> )
{
my $line = $_;
my #lineColumns = split( /,/, $line );
# is the first column and index value
if ( $lineColumns[0] =~ m/\d+/ )
{
$group = $lineColumns[0];
( $line ) =~ /("[^"]+")/;
$filename = $1;
$filename =~ s/\"//g;
if ( defined $currentGroup )
{
if ( $group == $currentGroup )
{
( $name, $path, $extension ) = fileparse ( $filename, '\..*"' );
store_keep_and_rename();
}
else # group changed
{
match_the_filenames();
( $name, $path, $extension ) = fileparse ( $filename, '\..*"' );
store_keep_and_rename();
}
}
else # first time - beginning of file
{
$currentGroup = $group;
( $name, $path, $extension ) = fileparse ( $filename, '\..*"' );
store_keep_and_rename();
}
}
}
close( $fdffCsv );
match_the_filenames();
sub store_keep_and_rename
{
if ( $path =~ /($subpathOfFileToKeep)/ )
{
$filenameToKeep = $name.$extension;
}
else
{
push( #filesToRenameArray, $filename );
}
}
sub match_the_filenames
{
my $sizeOfFilesToRenameArraySize = scalar( #filesToRenameArray );
if ( $sizeOfFilesToRenameArraySize > 0 )
{
for (my $index = 0; $index < $sizeOfFilesToRenameArraySize; $index++ )
{
my $PreRename = $filesToRenameArray[$index];
my ( $preName, $prePath, $preExtension ) = fileparse ( $PreRename, '\..*' );
my $filenameToChange = $preName.$preExtension;
my $PostRename = $prePath.$filenameToKeep;
print STDOUT "Filename was: ".$PreRename."\n";
print STDOUT "Filename will be: ".$PostRename."\n\n";
rename $PreRename, $PostRename;
}
}
undef( #filesToRenameArray ); #filesToRenameArray = ();
$currentGroup = $group;
}
Beyond Compare can do that.
Just select the file on the left and the file to compare on the right.
Choose 'compare' or use the align function (right mouse button)

Looping SQL query in Bash script

i am new to bash scripting and i was wondering if anyone could help me with the following.
I am trying to retrieve the competition name from a Oracle database using competition_id using the following statement:
select name, competition_type from competitions where competition_id=' ';
However i want to use a seperate text file whcih has a list competition_ids i want to identify, i want my script to find the name and type of all my ids and output the results in a txt file. this is what i have so far:
#!/bin/bash
echo Start Executing SQL commands
cat comps_ids.txt | while read ID
var=$ID
do
sqlplus "details"
<< EOF
select name, competition_type
from competitions
where competition_id=$var;
exit;
EOF
I tried to add a done at the end but i get "unexpected line ending" error message. Can anyone solve this?
Many thanks in advance :)
I'm not sure what your command line should look like, but it's more like
sqlplus "details" <<EOF
select name, competition_type from competitions where competition_id=$val;
exit;
EOF
If your list of IDs isn't too big, it may be better idea to make a ,-separated list and single query.
#!/bin/bash
function get_comp () {
sqlplus -S user/pass#database << EOF
set pagesize 0
set feedback off
set head off
select name, competition_type
from competitions
where competition_id=$1;
EOF
}
for id in $* ; do
get_comp $id
done
Put it in a file (get_comps.sh), and then call it like this
$ ./get_comps.sh < comp_ids.txt > text_file_out.txt
-S makes sqlplus quieter
The other setting make it return just your data, not row headers or anything else.
Of course the database credentials will be stored in your history, and available to other users using 'ps' or 'top'.
This is also horribly inefficient because it connects to the database for each row in your original file. If you have a lot of rows, you might try using python or ruby as their database stuff is pretty easy to use.

Bash $PATH is caching every modification

How to clear the cache of $PATH in BASH. Every time I modify the $PATH, the former modifications are conserved too! So my $PATH is already one page :-), and it bothers me to work, because it points to some not right places (because every modification is being appended in the end of the $PATH variable). Please help me to solve this problem.
because every modification is being
appended in the end of the $PATH
variable
Take a close look at where you are setting $PATH, I bet it looks something like this:
PATH="$PATH:/some/new/dir:/another/newdir:"
Having $PATH in the new assignment gives you the appending behavior you don't want.
Instead do this:
PATH="/some/new/dir:/another/newdir:"
Update
If you want to strip $PATH of all duplicate entries but still maintain the original order then you can do this:
PATH=$(awk 'BEGIN{ORS=":";RS="[:\n]"}!a[$0]++' <<<"${PATH%:}")
PATH=$(echo $PATH | tr ':' '\n' | sort | uniq | tr '\n' ':')
Once in a while execute the above command. It will tidy up your PATH variable by removing any duplication.
-Cheers
PS: Warning: This will reorder the Paths in PATH variable. And can have undesired effects !!
When I'm setting my PATH, I usually use this script - which I last modified in 1999, it seems (but use daily on all my Unix-based computers). It allows me to add to my PATH (or LD_LIBRARY_PATH, or CDPATH, or any other path-like variable) and eliminate duplicates, and trim out now unwanted values.
Usage
export PATH=$(clnpath /important/bin:$PATH:/new/bin /old/bin:/debris/bin)
The first argument is the new path, built by any technique you like. The second argument is a list of names to remove from the path (if they appear - no error if they don't). For example, I have up to about five versions of the software I work on installed at any given time. To switch between versions, I use this script to adjust both PATH and LD_LIBRARY_PATH to pick up the correct values for the version I'm about to start using, and remove the values of the version I'm no longer using.
Code
: "#(#)$Id: clnpath.sh,v 1.6 1999/06/08 23:34:07 jleffler Exp $"
#
# Print minimal version of $PATH, possibly removing some items
case $# in
0) chop=""; path=${PATH:?};;
1) chop=""; path=$1;;
2) chop=$2; path=$1;;
*) echo "Usage: `basename $0 .sh` [$PATH [remove:list]]" >&2
exit 1;;
esac
# Beware of the quotes in the assignment to chop!
echo "$path" |
${AWK:-awk} -F: '#
BEGIN { # Sort out which path components to omit
chop="'"$chop"'";
if (chop != "") nr = split(chop, remove); else nr = 0;
for (i = 1; i <= nr; i++)
omit[remove[i]] = 1;
}
{
for (i = 1; i <= NF; i++)
{
x=$i;
if (x == "") x = ".";
if (omit[x] == 0 && path[x]++ == 0)
{
output = output pad x;
pad = ":";
}
}
print output;
}'
Commentary
The ':' is an ancient way of using /bin/sh (originally the Bourne shell - now as often Bash) to run the script. If I updated it, the first line would become a shebang. I'd also not use tabs in the code. And there are ways to get the 'chop' value set that do not involve as many quotes:
awk -F: '...script...' chop="$chop"
But it isn't broken, so I haven't fixed it.
When adding entries to PATH, you should check to see if they're already there. Here's what I use in my .bashrc:
pathadd() {
if [ -d "$1" ] && [[ ":$PATH:" != *":$1:"* ]]; then
PATH="$PATH:$1"
fi
}
pathadd /usr/local/bin
pathadd /usr/local/sbin
pathadd ~/bin
This only adds directories to PATH if they exist (i.e. no bogus entries) and aren't already there. Note: the pattern matching feature I use to see if the entry is already in PATH is only available in bash, not the original Bourne shell; if you want to use this with /bin/sh, that part'd need to be rewritten.
I have a nice set of scripts that add path variables to the beginning or end of PATH depending on the ordering I want. The problem is OSX starts with /usr/local/bin after /usr/bin, which is exactly NOT what I want (being a brew user and all). So what I do is put a new copy of /usr/local/bin in front of everything else and use the following to remove all duplicates (and leave ordering in place).
MYPATH=$(echo $MYPATH|perl -F: -lape'$_=join":",grep!$s{$_}++,#F')
I found this on perlmonks. Like most perl, it looks like line noise to me so I have no idea how it works, but work it does!