awk and csv columns - awk

I have a csv file that looks like this:
table
Raw CSV (;-separated) to be parsed:
number;abc;gender;middel;name;last name;existfrom;existuntil;ahv;specialnum;UID;pert;address;street;numberstr;po box;place;cntr;cntrcd;from;until;date;extra
1;100000388;lady;W;jane;doe;16.08.1930;31.12.9999;7.56777E+12;500000387;;N;WOA;angel;47;;6020;moon;NL;16.01.2016;31.12.9999;31.12.9999;
2;100000453;mister;M;jon;doe;29.12.1934;31.12.9999;7.56663E+12;500000452;;N;WOA;angel;11;;6314;moon;GR;16.01.2016;31.12.9999;22.09.2008;mutation
3;100000469;lady;W;jane;doe;16.02.1941;31.12.9999;7.56486E+12;500000468;;N;WOA;angel;11;;6314;mooon;DE;16.01.2016;31.12.9999;22.09.2008;mutation
4;100000490;mister;M;jon;doe;09.05.1936;31.12.9999;7.56841E+12;500000497;;N;WOA;silicon;65;;6340;moon;CZ;16.01.2016;31.12.9999;15.12.2010;Mutation
Formatted table for ease of reading:
+--------+-----------+--------+--------+------+-----------+------------+------------+-------------+------------+-----+------+---------+---------+-----------+--------+-------+-------+--------+------------+------------+------------+-----------+
| number | abc | gender | middel | name | last name | existfrom | existuntil | ahv | specialnum | UID | pert | address | street | numberstr | po box | place | cntr | cntrcd | from | until | date | extra |
+--------+-----------+--------+--------+------+-----------+------------+------------+-------------+------------+-----+------+---------+---------+-----------+--------+-------+-------+--------+------------+------------+------------+-----------+
| 1 | 100000388 | lady | W | jane | doe | 16.08.1930 | 31.12.9999 | 7.56777E+12 | 500000387 | | N | WOA | angel | 47 | | 6020 | moon | NL | 16.01.2016 | 31.12.9999 | 31.12.9999 | |
| 2 | 100000453 | mister | M | jon | doe | 29.12.1934 | 31.12.9999 | 7.56663E+12 | 500000452 | | N | WOA | angel | 11 | | 6314 | moon | GR | 16.01.2016 | 31.12.9999 | 22.09.2008 | mutation |
| 3 | 100000469 | lady | W | jane | doe | 16.02.1941 | 31.12.9999 | 7.56486E+12 | 500000468 | | N | WOA | angel | 11 | | 6314 | mooon | DE | 16.01.2016 | 31.12.9999 | 22.09.2008 | mutation |
| 4 | 100000490 | mister | M | jon | doe | 09.05.1936 | 31.12.9999 | 7.56841E+12 | 500000497 | | N | WOA | silicon | 65 | | 6340 | moon | CZ | 16.01.2016 | 31.12.9999 | 15.12.2010 | Mutation |
+--------+-----------+--------+--------+------+-----------+------------+------------+-------------+------------+-----+------+---------+---------+-----------+--------+-------+-------+--------+------------+------------+------------+-----------+
Among other tasks, I need to get the 3 last digits from column I and place them into column O, but with no success.
for example.
The column I has a number 7567766885966, I need to get the 966 and put it into column O, replacing the existing number that is in column O.
Until now I have managed to do this:
#!/bin/sh
oldCsvFile=old.csv
newCsvFile=new.csv
#insert first line as is to the new file
head -1 $oldCsvFile > $newCsvFile
#replace columns 7 and 13 /w overwriting values
awk '{FS=";"; OFS=FS; $7="\"01.08.2019\""; $13="\"RandomStreet\""; print $0}' $oldCsvFile >> $newCsvFile
delete second and last line, which are the first and last line of the original file /w overwritten columns
sed -i '2 d' $newCsvFile
sed -i '$ d' $newCsvFile
insert unmodified last line to the new file
tail -1 $oldCsvFile >> $newCsvFile'
Can you please help me out?

Related

Categorical GROUP BY with Monthly Totals

I have a table that looks like below with the employee name, Position, Hire Date and Exit Date.
+----------+----------------+-----------+-----------+
| Employee | Position | Hire Date | Exit Date |
+----------+----------------+-----------+-----------+
| John | Manager | 1-Feb-18 | |
| Mike | Senior Manager | 12-Oct-18 | 20-Jul-22 |
| Jennifer | Manager | 3-Apr-19 | |
| Cindy | CSR | 25-Nov-19 | 30-Mar-22 |
| Tom | CSR | 18-Jul-22 | |
| Rodrigo | Director | 19-Oct-21 | |
| Ashley | CSR | 17-Jan-22 | |
+----------+----------------+-----------+-----------+
I am looking to get the monthly headcount totals for each position starting January 2020. If Exit Date is blank it means that the employee is still currently working at the company. Below is an example of the desired output I am looking for:
+----------------+--------+--------+--------+--------+----+--------+
| Position | Jan-20 | Feb-20 | Mar-20 | Apr-20 | …. | Sep-22 |
+----------------+--------+--------+--------+--------+----+--------+
| Manager | 5 | 6 | 7 | 4 | | 9 |
| Senior Manager | 2 | 1 | 4 | 4 | | 4 |
| Director | 1 | 1 | 2 | 3 | | 5 |
| CSR | 10 | 14 | 15 | 15 | | 18 |
+----------------+--------+--------+--------+--------+----+--------+

Compare data between 2 different source

I have a two datasets coming from 2 sources and i have to compare and find the mismatches. One from excel and other from Datawarehouse.
From excel Source_Excel
+-----+-------+------------+----------+
| id | name | City_Scope | flag |
+-----+-------+------------+----------+
| 101 | Plate | NY|TN | Ready |
| 102 | Nut | NY|TN | Sold |
| 103 | Ring | TN|MC | Planning |
| 104 | Glass | NY|TN|MC | Ready |
| 105 | Bolt | MC | Expired |
+-----+-------+------------+----------+
From DW Source_DW
+-----+-------+------+----------+
| id | name | City | flag |
+-----+-------+------+----------+
| 101 | Plate | NY | Ready |
| 101 | Plate | TN | Ready |
| 102 | Nut | TN | Expired |
| 103 | Ring | MC | Planning |
| 104 | Glass | MC | Ready |
| 104 | Glass | NY | Ready |
| 105 | Bolt | MC | Expired |
+-----+-------+------+----------+
Unfortunately Data from excel comes with separator for one column. So i have to use DelimitedSplit8K function to split that into individual rows. so i got the below output after splitting the excel source data.
+-----+-------+------+----------+
| id | name | item | flag |
+-----+-------+------+----------+
| 101 | Plate | NY | Ready |
| 101 | Plate | TN | Ready |
| 102 | Nut | NY | Sold |
| 102 | Nut | TN | Sold |
| 103 | Ring | TN | Planning |
| 103 | Ring | MC | Planning |
| 104 | Glass | NY | Ready |
| 104 | Glass | TN | Ready |
| 104 | Glass | MC | Ready |
| 105 | Bolt | MC | Expired |
+-----+-------+------+----------+
Now my expected output is something like this.
+-----+----------+---------------+--------------+
| ID | Result | Flag_mismatch | City_Missing |
+-----+----------+---------------+--------------+
| 101 | No_Error | | |
| 102 | Error | Yes | Yes |
| 103 | Error | No | Yes |
| 104 | Error | Yes | No |
| 105 | No_Error | | |
+-----+----------+---------------+--------------+
Logic:
I have to find if there are any mismatches in flag values.
After splitting if there are any city missing, then that should be reported.
Assume that there wont be any Name and city mismatches.
As a intial step, I'm trying to get the Mismatch rows and I have tried below query. It is not giving me any output. Please suggest where am going wrong.Check Fiddle Here
select a.id,a.name,split.item,a.flag from source_excel a
CROSS APPLY dbo.DelimitedSplit8k(a.city_scope,'|') split
where not exists (
select a.id,split.item
from source_excel a
join source_dw b
on a.id=b.id and a.name=b.name and a.flag=b.flag and split.item=b.city
)
Update
I have tried and got close to the answers with the help of temporary tables. Updated Fiddle . But not sure how to do without temp tables

Combining certain columns from multiple files based on a particular column , and not eliminating duplicate names

I had to combine values of 2nd columns from multiple files, based on the 7th column of all files, so based on Ed Morton's answer in similar question (Combining certain columns of several tab-delimited files based on first column) , I wrote code like this :
awk 'FNR==1 { ++numFiles}
!seen[$7]++ { keys[++numKeys] = $7 }
{ a[$7,numFiles] = $2 }
END {
for (keyNr=1; keyNr<=numKeys; keyNr++) {
key = keys[keyNr]
printf "%s", key
for (fileNr=1;fileNr<=numFiles;fileNr++) {
printf "\t%s", ((key,fileNr) in a ? a[key,fileNr] : "NA")
}
print ""
} } ' file1.txt file2.txt file3.txt > combined.txt
INPUT FILE 1 :
+-------+-----------------+----------+-------------+----------+-------------+-------------+
| ID | adj.P.Val_file1 | P.Value | t | B | logFC | Gene.symbol |
+-------+-----------------+----------+-------------+----------+-------------+-------------+
| 36879 | 1.66E-09 | 7.02E-14 | -12.3836337 | 21.00111 | -2.60060826 | AA |
| 33623 | 1.66E-09 | 7.39E-14 | -12.3599517 | 20.95461 | -2.53106808 | AA |
| 23271 | 2.70E-09 | 2.30E-13 | -11.8478184 | 19.93024 | -2.15050984 | BB |
| 67 | 2.70E-09 | 2.40E-13 | -11.829044 | 19.892 | -3.06680932 | BB |
| 33207 | 1.21E-08 | 1.35E-12 | -11.0793461 | 18.32425 | -2.65246816 | CC |
| 24581 | 1.81E-08 | 2.41E-12 | -10.8325542 | 17.79052 | -1.87937753 | CC |
| 32009 | 3.25E-08 | 5.05E-12 | -10.5240537 | 17.11081 | -1.46505166 | CC |
+-------+-----------------+----------+-------------+----------+-------------+-------------+
INPUT FILE 2 :
+-------+-----------------+----------+------------+-----------+------------+--------------+
| ID | adj.P.Val_file2 | P.Value | t | B | logFC | Gene.symbol |
+-------+-----------------+----------+------------+-----------+------------+--------------+
| 40000 | 5.43E-13 | 1.21E-17 | 17.003819 | 29.155646 | 2.4805744 | FGH |
| 32388 | 1.15E-11 | 5.12E-16 | 14.920047 | 25.829874 | 2.2497567 | FGH |
| 33623 | 6.08E-11 | 4.43E-15 | -13.8115 | 23.870549 | -2.8161587 | ASD |
| 25002 | 6.08E-11 | 5.40E-15 | 13.713018 | 23.689571 | 2.2164681 | ASD |
| 33207 | 2.03E-10 | 2.29E-14 | -13.009752 | 22.36291 | -2.8787392 | ASD |
| 13018 | 2.03E-10 | 2.71E-14 | 12.929201 | 22.207038 | 3.0181585 | ASD |
| 5539 | 2.24E-10 | 3.48E-14 | 12.810902 | 21.976634 | 3.0849706 | ASD |
+-------+-----------------+----------+------------+-----------+------------+--------------+
DESIRED OUTPUT :
+-------------+-----------------+-----------------+
| Gene.symbol | adj.P.Val_file1 | adj.P.Val_file2 |
+-------------+-----------------+-----------------+
| AA | 1.66E-09 | NA |
| AA | 1.66E-09 | NA |
| BB | 2.70E-09 | NA |
| BB | 2.70E-09 | NA |
| CC | 1.21E-08 | NA |
| CC | 1.81E-08 | NA |
| CC | 3.25E-08 | NA |
| FGH | NA | 5.43E-13 |
| FGH | NA | 1.15E-11 |
| ASD | NA | 6.08E-11 |
| ASD | NA | 6.08E-11 |
| ASD | NA | 2.03E-10 |
| ASD | NA | 2.03E-10 |
| ASD | NA | 2.24E-10 |
+-------------+-----------------+-----------------+
The problem is that the 7th column has repetitive names, and the code takes the first occurrence of a particular name, I want the results for all the repetitive names. I tried deleting each line of the code and understand, but couldn't come up with solution
Finally figured out answer by myself !
I just have to eliminate the line : !seen[$7]++ from my code , as by including that it would only consider the first occurrence of any replicated name of 7th column (nth column in general)

SAS SQL: Many to many relationships with 2 tables BUT don't want multiple rows

I have two tables I need to join. These tables only share 1 field in common (ID, and it isn't unique). Is it possible to join these two tables but make it unique and keep all matching data in a row?
For example, I have two tables as follows:
+-------+----------+
| ID | NAME |
+-------+----------+
| A | Jack |
| A | Andy |
| A | Steve |
| A | Jay |
| B | Chris |
| B | Vicky |
| B | Emma |
+-------+----------+
And another table that is ONLY related by the ID column:
+-------+--------+
| ID | Age |
+-------+--------+
| A | 22 |
| A | 31 |
| A | 11 |
| B | 40 |
| B | 17 |
| B | 20 |
| B | 3 |
| B | 65 |
+-------+--------+
The end result I'd like to get is:
+-------+----------+++-------+
| ID | NAME | Age |
+-------+----------++-------+-
| A | Jack | 22 |
| A | Andy | 31 |
| A | Steve | 11 |
| A | Jay | null |
| B | Chris | 40 |
| B | Vicky | 17 |
| B | Emma | 20 |
| B | null | 3 |
| B | null | 65 |
+-------+----------+++-------+
This is the default behavior of the data step merge, except that it won't set the last row's variable to missing - but it's easy to fudge.
There are other ways to do this, the best in my opinion being the hash object if you're comfortable with that.
data names;
infile datalines dlm='|';
input ID $ NAME $;
datalines;
| A | Jack |
| A | Andy |
| A | Steve |
| A | Jay |
| B | Chris |
| B | Vicky |
| B | Emma |
;;;;
run;
data ages;
infile datalines dlm='|';
input id $ age;
datalines;
| A | 22 |
| A | 31 |
| A | 11 |
| B | 40 |
| B | 17 |
| B | 20 |
| B | 3 |
| B | 65 |
;;;;
run;
data want;
merge names(in=_a) ages(in=_b);
by id;
if _a;
if name ne lag(name) then output; *this assumes `name` is unique in id - if it is not we may have to do a bit more work here;
call missing(age); *clear age after output so we do not attempt to fill extra rows with the same age - age will be 'retain'ed;
run;

Remove a specific word and everything after in a Cell

I'm having trouble deleting everything after and including the word APT and STE on Address column
If you look at the result blow, my vba code is deleting the word that the letters ste ( 2121 STEVENSON LN) or apt.
What is best way to to remove the word APT or STE and everything after that?
Below is my code,
Option Explicit
Sub Remove()
Dim Sht As Worksheet
Set Sht = ActiveWorkbook.Sheets("Data")
With Sht.Range("E:E")
.Replace "APT*", "", xlPart
.Replace "STE*", "", xlPart
End With
End Sub
My Data
+-------+-----+-----+----------+----------------------------------+-------+------+-----+
| Route | Pcs | Wgt | Location | Address | Suite | City | Zip |
+-------+-----+-----+----------+----------------------------------+-------+------+-----+
| SD-26 | 1 | 3 | | 5555 SOUTHWESTERN BLVD | | | |
| SD-26 | 1 | 7 | | 6666 EASTERN AVE APT 100 | | | |
| SD-05 | 1 | 1 | | 161112 HOMESTEAD ST | | | |
| SD-05 | 2 | 8 | | 2221 STEVENSON LN | | | |
| SD-04 | 1 | 8 | | 4040 OLD DENTON RD APT 2104 | | | |
| SD-04 | 1 | 3 | | 15811 E FRANKFORD RD APT 1507 | | | |
| SD-04 | 1 | 1 | | 835 WESTMINSTER DR | | | |
| SD-03 | 1 | 5 | | 9001 LAKESIDE CIR APT 5203 | | | |
| SD-03 | 1 | 3 | | 8880 UNION STATION PKWY APT 2104 | | | |
| SD-03 | 1 | 1 | | 420 E MAIN ST STE E | | | |
+-------+-----+-----+----------+----------------------------------+-------+------+-----+
Result
+-------+-----+-----+----------+--------------------------+-------+------+-----+
| Route | Pcs | Wgt | Location | Address | Suite | City | Zip |
+-------+-----+-----+----------+--------------------------+-------+------+-----+
| SD-26 | 1 | 3 | | 5555 SOUTHWE | | | |
| SD-26 | 1 | 7 | | 6666 EA | | | |
| SD-05 | 1 | 1 | | 161112 HOME | | | |
| SD-05 | 2 | 8 | | 2221 | | | |
| SD-04 | 1 | 8 | | 4040 OLD DENTON RD | | | |
| SD-04 | 1 | 3 | | 15811 E FRANKFORD RD | | | |
| SD-04 | 1 | 1 | | 835 WESTMIN | | | |
| SD-03 | 1 | 5 | | 9001 LAKESIDE CIR | | | |
| SD-03 | 1 | 3 | | 8880 UNION STATION PKWY | | | |
| SD-03 | 1 | 1 | | 420 E MAIN ST | | | |
+-------+-----+-----+----------+--------------------------+-------+------+-----+
Seems to me you can just include a space before and after APT/STE but before the wildcard character.
Sub RemoveAptSte()
Dim Sht As Worksheet
Set Sht = ActiveWorkbook.Sheets("Data")
With Sht.Range("E:E")
.Replace " APT *", vbNullString, xlPart
.Replace " STE *", vbNullString, xlPart
End With
End Sub
That should remove just about any false positive from consideration.