Combining certain columns from multiple files based on a particular column , and not eliminating duplicate names

Combining certain columns from multiple files based on a particular column , and not eliminating duplicate names - awk

I had to combine values of 2nd columns from multiple files, based on the 7th column of all files, so based on Ed Morton's answer in similar question (Combining certain columns of several tab-delimited files based on first column) , I wrote code like this :
awk 'FNR==1 { ++numFiles}
!seen[$7]++ { keys[++numKeys] = $7 }
{ a[$7,numFiles] = $2 }
END {
for (keyNr=1; keyNr<=numKeys; keyNr++) {
key = keys[keyNr]
printf "%s", key
for (fileNr=1;fileNr<=numFiles;fileNr++) {
printf "\t%s", ((key,fileNr) in a ? a[key,fileNr] : "NA")
}
print ""
} } ' file1.txt file2.txt file3.txt > combined.txt
INPUT FILE 1 :
+-------+-----------------+----------+-------------+----------+-------------+-------------+
| ID | adj.P.Val_file1 | P.Value | t | B | logFC | Gene.symbol |
+-------+-----------------+----------+-------------+----------+-------------+-------------+
| 36879 | 1.66E-09 | 7.02E-14 | -12.3836337 | 21.00111 | -2.60060826 | AA |
| 33623 | 1.66E-09 | 7.39E-14 | -12.3599517 | 20.95461 | -2.53106808 | AA |
| 23271 | 2.70E-09 | 2.30E-13 | -11.8478184 | 19.93024 | -2.15050984 | BB |
| 67 | 2.70E-09 | 2.40E-13 | -11.829044 | 19.892 | -3.06680932 | BB |
| 33207 | 1.21E-08 | 1.35E-12 | -11.0793461 | 18.32425 | -2.65246816 | CC |
| 24581 | 1.81E-08 | 2.41E-12 | -10.8325542 | 17.79052 | -1.87937753 | CC |
| 32009 | 3.25E-08 | 5.05E-12 | -10.5240537 | 17.11081 | -1.46505166 | CC |
+-------+-----------------+----------+-------------+----------+-------------+-------------+
INPUT FILE 2 :
+-------+-----------------+----------+------------+-----------+------------+--------------+
| ID | adj.P.Val_file2 | P.Value | t | B | logFC | Gene.symbol |
+-------+-----------------+----------+------------+-----------+------------+--------------+
| 40000 | 5.43E-13 | 1.21E-17 | 17.003819 | 29.155646 | 2.4805744 | FGH |
| 32388 | 1.15E-11 | 5.12E-16 | 14.920047 | 25.829874 | 2.2497567 | FGH |
| 33623 | 6.08E-11 | 4.43E-15 | -13.8115 | 23.870549 | -2.8161587 | ASD |
| 25002 | 6.08E-11 | 5.40E-15 | 13.713018 | 23.689571 | 2.2164681 | ASD |
| 33207 | 2.03E-10 | 2.29E-14 | -13.009752 | 22.36291 | -2.8787392 | ASD |
| 13018 | 2.03E-10 | 2.71E-14 | 12.929201 | 22.207038 | 3.0181585 | ASD |
| 5539 | 2.24E-10 | 3.48E-14 | 12.810902 | 21.976634 | 3.0849706 | ASD |
+-------+-----------------+----------+------------+-----------+------------+--------------+
DESIRED OUTPUT :
+-------------+-----------------+-----------------+
| Gene.symbol | adj.P.Val_file1 | adj.P.Val_file2 |
+-------------+-----------------+-----------------+
| AA | 1.66E-09 | NA |
| AA | 1.66E-09 | NA |
| BB | 2.70E-09 | NA |
| BB | 2.70E-09 | NA |
| CC | 1.21E-08 | NA |
| CC | 1.81E-08 | NA |
| CC | 3.25E-08 | NA |
| FGH | NA | 5.43E-13 |
| FGH | NA | 1.15E-11 |
| ASD | NA | 6.08E-11 |
| ASD | NA | 6.08E-11 |
| ASD | NA | 2.03E-10 |
| ASD | NA | 2.03E-10 |
| ASD | NA | 2.24E-10 |
+-------------+-----------------+-----------------+
The problem is that the 7th column has repetitive names, and the code takes the first occurrence of a particular name, I want the results for all the repetitive names. I tried deleting each line of the code and understand, but couldn't come up with solution

Finally figured out answer by myself !
I just have to eliminate the line : !seen[$7]++ from my code , as by including that it would only consider the first occurrence of any replicated name of 7th column (nth column in general)

Related

Spark DataFrame: Ignore columns with empty IDs in groupBy

I have a dataframe e.g. with this structure:
ID | Date | P1_ID | P2_ID | P3_ID | P1_A | P1_B | P2_A | ...
============================================================
1 | 123 | 1 | | | A1 | B1 | | ... <- only P1_x columns filled
1 | 123 | 2 | | | A2 | B2 | | ... <- only P1_x filled
1 | 123 | 3 | | | A3 | B3 | | ... <- only P1_x filled
1 | 123 | | 1 | | | | A4 | ... <- only P2_x filled
1 | 123 | | 2 | | | | A5 | ... <- only P2_x filled
1 | 123 | | | 1 | | | | ... <- only P3_x filled
I need to combine the rows, that have the same ID, Date and Px_ID values, but without caring for empty values in the Px_ID when comparing the key columns.
In the end I need a dataframe like this:
ID | Date | P1_ID | P2_ID | P3_ID | P1_A | P1_B | P2_A | ...
============================================================
1 | 123 | 1 | 1 | 1 | A1 | B1 | A4 | ...
1 | 123 | 2 | 2 | | A2 | B2 | A5 | ...
1 | 123 | 3 | | | A3 | B3 | | ...
Is this possible and how? Thank you!

I found a solution for this problem: Since the non-relevant x_ID columns are empty, one possible way is to create a new column combined_ID that contains a concatenation of all x_ID column values (this will only contain one value, since only one x_ID is not empty in each row):
var xIdArray = Seq[Column]("P1_ID", "P2_ID", "P3_ID")
myDF = myDF.withColumn("combined_ID", concat(xIdArray : _*))
This changes the DF to following structure:
ID | Date | P1_ID | P2_ID | P3_ID | P1_A | P1_B | P2_A | ... | combined_ID
===========================================================================
1 | 123 | 1 | | | A1 | B1 | | ... | 1
1 | 123 | 2 | | | A2 | B2 | | ... | 2
1 | 123 | 3 | | | A3 | B3 | | ... | 3
1 | 123 | | 1 | | | | A4 | ... | 1
1 | 123 | | 2 | | | | A5 | ... | 2
1 | 123 | | | 1 | | | | ... | 1
Now, I can simply group my DF by ID, Date and combined_ID and aggreate all the relevant columns by e.g. max function to get the values of the non-empty cells:
var groupByColumns : Seq[String] = Seq("ID", "Date", "x_ID")
var aggColumns : Seq[String] = Seq("P1_ID", "P2_ID", "P3_ID", "P1_A", "P1_B", "P2_A", ...)
myDF = myDF.groupBy(groupByColumns.head, groupByColumns.tail : _*).agg(aggColumns.head, aggColumns.tail : _*)
Result:
ID | Date | combined_ID | P1_ID | P2_ID | P3_ID | P1_A | P1_B | P2_A | ...
===========================================================================
1 | 123 | 1 | 1 | 1 | 1 | A1 | B1 | A4 | ...
1 | 123 | 2 | 2 | 2 | | A2 | B2 | A5 | ...
1 | 123 | 3 | 3 | | | A3 | B3 | | ...

awk and csv columns

I have a csv file that looks like this:
table
Raw CSV (;-separated) to be parsed:
number;abc;gender;middel;name;last name;existfrom;existuntil;ahv;specialnum;UID;pert;address;street;numberstr;po box;place;cntr;cntrcd;from;until;date;extra
1;100000388;lady;W;jane;doe;16.08.1930;31.12.9999;7.56777E+12;500000387;;N;WOA;angel;47;;6020;moon;NL;16.01.2016;31.12.9999;31.12.9999;
2;100000453;mister;M;jon;doe;29.12.1934;31.12.9999;7.56663E+12;500000452;;N;WOA;angel;11;;6314;moon;GR;16.01.2016;31.12.9999;22.09.2008;mutation
3;100000469;lady;W;jane;doe;16.02.1941;31.12.9999;7.56486E+12;500000468;;N;WOA;angel;11;;6314;mooon;DE;16.01.2016;31.12.9999;22.09.2008;mutation
4;100000490;mister;M;jon;doe;09.05.1936;31.12.9999;7.56841E+12;500000497;;N;WOA;silicon;65;;6340;moon;CZ;16.01.2016;31.12.9999;15.12.2010;Mutation
Formatted table for ease of reading:
+--------+-----------+--------+--------+------+-----------+------------+------------+-------------+------------+-----+------+---------+---------+-----------+--------+-------+-------+--------+------------+------------+------------+-----------+
| number | abc | gender | middel | name | last name | existfrom | existuntil | ahv | specialnum | UID | pert | address | street | numberstr | po box | place | cntr | cntrcd | from | until | date | extra |
+--------+-----------+--------+--------+------+-----------+------------+------------+-------------+------------+-----+------+---------+---------+-----------+--------+-------+-------+--------+------------+------------+------------+-----------+
| 1 | 100000388 | lady | W | jane | doe | 16.08.1930 | 31.12.9999 | 7.56777E+12 | 500000387 | | N | WOA | angel | 47 | | 6020 | moon | NL | 16.01.2016 | 31.12.9999 | 31.12.9999 | |
| 2 | 100000453 | mister | M | jon | doe | 29.12.1934 | 31.12.9999 | 7.56663E+12 | 500000452 | | N | WOA | angel | 11 | | 6314 | moon | GR | 16.01.2016 | 31.12.9999 | 22.09.2008 | mutation |
| 3 | 100000469 | lady | W | jane | doe | 16.02.1941 | 31.12.9999 | 7.56486E+12 | 500000468 | | N | WOA | angel | 11 | | 6314 | mooon | DE | 16.01.2016 | 31.12.9999 | 22.09.2008 | mutation |
| 4 | 100000490 | mister | M | jon | doe | 09.05.1936 | 31.12.9999 | 7.56841E+12 | 500000497 | | N | WOA | silicon | 65 | | 6340 | moon | CZ | 16.01.2016 | 31.12.9999 | 15.12.2010 | Mutation |
+--------+-----------+--------+--------+------+-----------+------------+------------+-------------+------------+-----+------+---------+---------+-----------+--------+-------+-------+--------+------------+------------+------------+-----------+
Among other tasks, I need to get the 3 last digits from column I and place them into column O, but with no success.
for example.
The column I has a number 7567766885966, I need to get the 966 and put it into column O, replacing the existing number that is in column O.
Until now I have managed to do this:
#!/bin/sh
oldCsvFile=old.csv
newCsvFile=new.csv
#insert first line as is to the new file
head -1 $oldCsvFile > $newCsvFile
#replace columns 7 and 13 /w overwriting values
awk '{FS=";"; OFS=FS; $7="\"01.08.2019\""; $13="\"RandomStreet\""; print $0}' $oldCsvFile >> $newCsvFile
delete second and last line, which are the first and last line of the original file /w overwritten columns
sed -i '2 d' $newCsvFile
sed -i '$ d' $newCsvFile
insert unmodified last line to the new file
tail -1 $oldCsvFile >> $newCsvFile'
Can you please help me out?

VBA Copy & Paste Loop ( Generate Field Number)

Right now im working to generate a label based on quantity in excel. I managed to get it copy & paste based on value from cell. But, i didnt know how to make some cell change according to the loop.
Below is as example :
Current result :
| A | B | C | D | E |
|------------------------------- |----- |-------------------- |----- |----- |
| NMB IN DIA | | MADE IN THAILAND | | |
| INVOICE NO | : | MM035639 | | |
| C/NO | : | 1 | / | 2 |
| SHIP TO | : | A | | |
| QTY | : | 100 | | |
| NMB PARTS NO | : | SFASDF234 | | |
| | | *SFASDF234* | | |
| CUST PARTS NO | : | SFASDF234 | | |
| CUST ORDER NO | : | | | |
| ----------------------------- | --- | ------------------ | --- | --- |
| NMB IN DIA | | MADE IN THAILAND | | |
| INVOICE NO | : | MM035639 | | |
| C/NO | : | 1 | / | 2 |
| SHIP TO | : | A | | |
| QTY | : | 100 | | |
| NMB PARTS NO | : | SFASDF234 | | |
| | | *SFASDF234* | | |
| CUST PARTS NO | : | | | |
| CUST ORDES NO | : | | | |
Expected result :
| A | B | C | D | E |
|------------------------------- |----- |-------------------- |----- |----- |
| NMB IN DIA | | MADE IN THAILAND | | |
| INVOICE NO | : | MM035639 | | |
| C/NO | : | 1 | / | 2 |
| SHIP TO | : | A | | |
| QTY | : | 100 | | |
| NMB PARTS NO | : | SFASDF234 | | |
| | | *SFASDF234* | | |
| CUST PARTS NO | : | SFASDF234 | | |
| CUST ORDER NO | : | | | |
| ----------------------------- | --- | ------------------ | --- | --- |
| NMB IN DIA | | MADE IN THAILAND | | |
| INVOICE NO | : | MM035639 | | |
| C/NO | : | 2 | / | 2 |
| SHIP TO | : | A | | |
| QTY | : | 100 | | |
| NMB PARTS NO | : | SFASDF234 | | |
| | | *SFASDF234* | | |
| CUST PARTS NO | : | | | |
| CUST ORDES NO | : | | | |
As you can see on the expected result, the C/No is loop based on quantity. Not just copy paste. Is there anything I can add?
Below is my current code :
Private Sub CommandButton1_Click()
Dim i As Long
For i = 2 To Worksheets("Sheet3").Range("E3").Value
Range("A1:A9", Range("E9")).Copy Sheet3.Range("A65536").End(xlUp)(2)
Next i
End Sub

Just set the value of the relevant cell to i:
Private Sub CommandButton1_Click()
Dim i As Long
Dim NewLoc As Range
For i = 2 To Worksheets("Sheet3").Range("E3").Value
'Decide where to copy the output to
Set NewLoc = Sheet3.Cells(Sheet3.Rows.Count, "A").End(xlUp).OffSet(1, 0)
'Copy the range
Range("A1:E9").Copy NewLoc
'Change the value of the cell 2 rows down and 2 rows to the right
NewLoc.Offset(2, 2).Value = i
Next i
End Sub

Crosstab query to show the working hours per day for each Vessel

I have made a Crosstab Query that should give information about the total working hours in each day for every Vessel we had in our small harbor.
my query:
TRANSFORM Sum(Main.WorkingH) AS SumOfWorkingH
SELECT DateValue([DeptDate]) AS [Date]
FROM Vessels INNER JOIN Main ON Vessels.ID = Main.VesselID
GROUP BY DateValue([DeptDate])
ORDER BY DateValue([DeptDate])
PIVOT Vessels.Vessel;
the problem here is this query is returning the total working hours start from departure date
| +---------------+--------+----+----+----+----+----+----+ |
| | | | | | | | | | |
| +---------------+--------+----+----+----+----+----+----+ |
| | Date | A1 | A2 | A3 | F3 | F4 | F5 | F6 | |
| | 26-May-17 | | | 32 | 29 | | | | |
| | 27-May-17 | 3 | 13 | | | | | | |
| | 28-May-17 | | | | | | | 73 | |
| | 29-May-17 | | | | 12 | 6 | 27 | | |
| | 01-Jun-17 | | | 10 | | 7 | 41 | | |
| | 02-Jun-17 | | 2 | 15 | 5 | | | | |
| | 03-Jun-17 | | 4 | | | | | | |
| +---------------+--------+----+----+----+----+----+----+ |
The desired Result
when a vessel leaves at 6/1 9pm and arrive back at 6/3 10am. This should appear as following:
6/1-->3Hours
6/2-->24Hours
6/3-->10Hours
**NOT** 6/1-->37Hours as in the previous table.
This is how it should look like
| +----------------+-----+----+----+----+--------+----+----+ |
| | Date | A1 | A2 | A3 | F3 | F4 | F5 | F6 | |
| +----------------+-----+----+----+----+--------+----+----+ |
| | 26-May-17 | | | 5 | 7 | | | | |
| | 27-May-17 | 3 | 13 | 24 | 21 | | | | |
| | 28-May-17 | | | 2 | | | | 9 | |
| | 29-May-17 | | | | 12 | 6 | 8 | 24 | |
| | 30-May-17 | | | | | | 18 | 24 | |
| | 31-May-17 | | | | | | | 15 | |
| | 01-Jun-17 | | | 10 | | 7 | 0 | | |
| | 02-Jun-17 | | 2 | 15 | 5 | 24 | | | |
| | 03-Jun-17 | | 4 | | | | 16 | | |
| +----------------+-----+----+----+----+--------+----+----+ |
These values are not accurate (I wrote them by hand), but I think you got the Idea
The Suggested Solution
while trying to fix this problem I made the following code which takes the
Public Function HoursByDate1(stTime, EndTime)
For dayloop = Int(EndTime) To Int(stTime) Step -1
If dayloop = Int(stTime) Then
WorkingHours = Hour(dayloop + 1 - stTime)
ElseIf dayloop = Int(EndTime) Then
WorkingHours = Hour(EndTime - dayloop)
Else
WorkingHours = 24
End If
HoursByDate1 = WorkingHours
Debug.Print "StartDate: " & stTime & ", EndDate:" & EndTime & ", The day:" & dayloop & " --> " & WorkingHours & " hours."
Next dayloop
End Function
It prints the data as following:
which is exactly what I want
But when I try to call this function from my query, It gets only the last value for each trip. as following:
| +-----------+----+----+----+----+----+----+----+ |
| | Date | A1 | A2 | A3 | F3 | F4 | F5 | F6 | |
| +-----------+----+----+----+----+----+----+----+ |
| | 5/26/2017 | | | 5 | 7 | | | | |
| | 5/27/2017 | 15 | 19 | | | | | | |
| | 5/28/2017 | | | | | | | 9 | |
| | 5/29/2017 | | | | 8 | 7 | 8 | | |
| | 6/1/2017 | | | 3 | | 6 | 0 | | |
| | 6/2/2017 | | 8 | 8 | 19 | | | | |
| | 6/3/2017 | | 9 | | | | | |
I seek any Solution: From VBA side of things or SQL Query Side.
Sorry for the very long question, but I wanted to show my effort on the subject because every time I am told that this is not enough Information

Need to shift the data to next column, unfortunately added data in wrong column

I have a table test
+----+--+------+--+--+--------------+--+--------------+
| ID | | Name1 | | | Name2 |
+----+--+------+--+--+--------------+--+--------------+
| 1 | | Andy | | | NULL |
| 2 | | Kevin | | | NULL |
| 3 | | Phil | | | NULL |
| 4 | | Maria | | | NULL |
| 5 | | Jackson | | | NULL |
+----+--+------+--+--+----------+--+--
I am expecting output like
+----+--+------+--+--+----------+--
| ID | | Name1 | | | Name2 |
+----+--+------+--+--+----------+--
| 1 | | NULL | | | Andy |
| 2 | | NULL | | | Kevin |
| 3 | | NULL | | | Phil |
| 4 | | NULL | | | Maria |
| 5 | | NULL | | | Jackson |
+----+--+------+--+--+----------+--
I unfortunately inserted data in wrong column and now I want to shift the data to the next column.

You can use an UPDATE statement with no WHERE condition, to cover the entire table.
UPDATE test
SET Name2 = Name1,
Name1 = NULL

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Combining certain columns from multiple files based on a particular column , and not eliminating duplicate names - awk

Finally figured out answer by myself ! I just have to eliminate the line : !seen[$7]++ from my code , as by including that it would only consider the first occurrence of any replicated name of 7th column (nth column in general)

Related

Spark DataFrame: Ignore columns with empty IDs in groupBy

awk and csv columns

VBA Copy & Paste Loop ( Generate Field Number)

Crosstab query to show the working hours per day for each Vessel

Need to shift the data to next column, unfortunately added data in wrong column

Categories

Resources