Remove repeated occurences in column 2 - awk

I have the following input file with 2 fields separated by tab. I hope to explain good enough.
Description
The field 2 contains chapters. In this case there are 2, HISTORY OF THE COUNTRY and PHYSICAL GEOGRAPHY,
All chapters are related with value 10 in field 1.
The sections beneath each chapter are related with value 07.
The content is related with value 05
The next chapter begins when $1==10 and $2 different than previous chapter.
In this case Chapter 1 goes from line 1 to line 16
In this case Chapter 2 goes from line 17 to end of file.
The chapters, sections and content could appear repeated times in field 2.
For example:
HISTORY OF THE COUNTRY appears 4 times between line 1 and line 16
PHYSICAL GEOGRAPHY appears 2 times between line 17 and end of file
My goal is:
Remove repeated occurrences of Chapters and Sections leaving all in the same order of appearence. For the content, don´t remove anything. I mean,
for chapter 1, remove repeated HISTORY OF THE COUNTRY within
chapter 1's context (between line 1 and 16)
for chapter 2, remove repeated PHYSICAL GEOGRAPHY within chapter 2's context (between
line 17 and 25)
The Input is this:
10 HISTORY OF THE COUNTRY
07 FIRST PART
07 INTRODUCTION
05 Article 1
10 HISTORY OF THE COUNTRY
07 FIRST PART
07 INTRODUCTION
05 Article 2
10 HISTORY OF THE COUNTRY
07 SECOND PART
07 REVIEW
05 Article 1
10 HISTORY OF THE COUNTRY
07 SECOND PART
07 METHODOLOGY
05 Article1
10 PHYSICAL GEOGRAPHY
07 FIRST PART
07 INTRODUCTION
05 First section
10 PHYSICAL GEOGRAPHY
07 FIRST PART
07 INTRODUCTION
05 Second Section
and output would be like this:
10 HISTORY OF THE COUNTRY
07 FIRST PART
07 INTRODUCTION
05 Article 1
05 Article 2
07 SECOND PART
07 REVIEW
05 Article 1
07 METHODOLOGY
05 Article1
10 PHYSICAL GEOGRAPHY
07 FIRST PART
07 INTRODUCTION
05 First section
05 Second Section
My current code prints something close, but not what I'm looking for.
awk '$2 in a {next} {
a[$2]++
}1' input.txt
10 HISTORY OF THE COUNTRY
07 FIRST PART
07 INTRODUCTION
05 Article 1
07 SECOND PART
07 REVIEW
07 METHODOLOGY
05 Article1
10 PHYSICAL GEOGRAPHY
05 First section
05 Second Section
To ease understanding, I show below the input with chapters in yellow and orange, sections in green. Besides that I show current output and desired output. Thanks for any help.

You may use this awk:
awk -F '\t' '$1 == 10 { ch = $2; sec = "" } $1+0 == 7 { sec = $2 }
($1+0 == 5 && !seen[ch,sec,$0]++) || !seen[ch,$0]++' file
10 HISTORY OF THE COUNTRY
07 FIRST PART
07 INTRODUCTION
05 Article 1
05 Article 2
07 SECOND PART
07 REVIEW
05 Article 1
07 METHODOLOGY
05 Article1
10 PHYSICAL GEOGRAPHY
07 FIRST PART
07 INTRODUCTION
05 First section
05 Second Section

another awk
You need to apply uniqueness in a path dependent way. Print all unique chapters, parts within a chapter, and articles/sections within a part.
$ awk '$1=="10" && !h1[c=$0]++;
$1=="07" && !h2[c,s=$0]++;
$1=="05" && !h3[c,s,$0]++' file
10 HISTORY OF THE COUNTRY
07 FIRST PART
07 INTRODUCTION
05 Article 1
05 Article 2
07 SECOND PART
07 REVIEW
05 Article 1
07 METHODOLOGY
05 Article1
10 PHYSICAL GEOGRAPHY
07 FIRST PART
07 INTRODUCTION
05 First section
05 Second Section
{print $0} is implied when the statement is missing.
in awk associated array a[k] (similar to hash maps) using the key k, initially equal to 0 or null value (false in boolean), post incremented a[k]++ and negated !a[k]++. So it will only be true for the first k, therefore can be used as a unique filter. Here h1, h2, and h3 correspond to unique headers at each level, where keys include the parent path as well, will only print the unique values based on the keys.
This script can be shortened for this problem but as given it's easy to modify if you need to add another layer.

Related

Fill in gap with prior record value having a populated quantity LIMIT: no analytics can be used

Assume data with structure like this: Demo
WITH CAL AS(
SELECT 2022 YR, '01' PERIOD UNION ALL
SELECT 2022 YR, '02' PERIOD UNION ALL
SELECT 2022 YR, '03' PERIOD UNION ALL
SELECT 2022 YR, '04' PERIOD UNION ALL
SELECT 2022 YR, '05' PERIOD UNION ALL
SELECT 2022 YR, '06' PERIOD UNION ALL
SELECT 2022 YR, '07' PERIOD UNION ALL
SELECT 2022 YR, '08' PERIOD UNION ALL
SELECT 2022 YR, '09' PERIOD UNION ALL
SELECT 2022 YR, '10' PERIOD UNION ALL
SELECT 2022 YR, '11' PERIOD UNION ALL
SELECT 2022 YR, '12' PERIOD ),
Data AS (
SELECT 2022 YR, '01' PERIOD, 10 qty UNION ALL
SELECT 2022 YR, '02' PERIOD, 5 qty UNION ALL
SELECT 2022 YR, '04' PERIOD, 10 qty UNION ALL
SELECT 2022 YR, '05' PERIOD, 7 qty UNION ALL
SELECT 2022 YR, '09' PERIOD, 1 qty)
SELECT *
FROM CAL A
LEFT JOIN data B
on A.YR = B.YR
and A.Period = B.Period
WHERE A.Period <10 and A.YR = 2022
ORDER by A.period
Giving us:
+------+--------+------+--------+-----+
| YR | PERIOD | YR | PERIOD | qty |
+------+--------+------+--------+-----+
| 2022 | 01 | 2022 | 01 | 10 |
| 2022 | 02 | 2022 | 02 | 5 |
| 2022 | 03 | | | |
| 2022 | 04 | 2022 | 04 | 10 |
| 2022 | 05 | 2022 | 05 | 7 |
| 2022 | 06 | | | |
| 2022 | 07 | | | |
| 2022 | 08 | | | |
| 2022 | 09 | 2022 | 09 | 1 |
+------+--------+------+--------+-----+
With Expected result of:
+------+--------+------+--------+-----+
| YR | PERIOD | YR | PERIOD | qty |
+------+--------+------+--------+-----+
| 2022 | 01 | 2022 | 01 | 10 |
| 2022 | 02 | 2022 | 02 | 5 |
| 2022 | 03 | 2022 | 03 | 5 | -- SQL derives
| 2022 | 04 | 2022 | 04 | 10 |
| 2022 | 05 | 2022 | 05 | 7 |
| 2022 | 06 | 2022 | 06 | 7 | -- SQL derives
| 2022 | 07 | 2022 | 07 | 7 | -- SQL derives
| 2022 | 08 | 2022 | 08 | 7 | -- SQL derives
| 2022 | 09 | 2022 | 09 | 1 |
+------+--------+------+--------+-----+
QUESTION:
How would one go about filling in the gaps in period 03, 06, 07, 08 with a record quantity referencing the nearest earlier period/year. Note example is limited to a year, but gap could be on period 01 of 2022 and we would need to return 2021 period 12 quantity if populated or keep going back until quantity is found, or no such record exists.
LIMITS:
I am unable to use table value functions. (No lateral, no Cross Apply)
I'm unable to use analytics (no lead/lag)
correlated subqueries are iffy.
Why the limits? this must be done in a HANA graphical calculation view. Which supports neither of those concepts. I've not done enough to know how to do a correlated subquery at this time to know if it's possible.
I can create any number of inline views or materialized datasets needed.
STATISTICS:
this table has over a million rows and grows at a rate of productlocationperiodsyears. so if you have 100020126=1.4 mil+ in 6 years with just 20 locations and 1000 products...
each product inventory may be recorded at at the end of a month for a given location. (no activity for product/location, no record hence a gap. Silly mainframe save storage technique used in a RDBMS... I mean how do I know the system just didn't error on inserting the record for that material; or omit it for some reason... )
In the cases where it is not recorded, we need to fill in the gap. The example provided is broken down to the bear bones without location and material as I do not believe it is not salient to a solution.
ISSUE:
I'll need to convert the SQL to a "HANA Graphical calculation view"
Yes, I know I could create a SQL Script to do this. This is not allowed.
Yes, I know I could create a table function to do this. This is not allowed.
This must be accomplished though Graphical calculation view which supports basic SQL functions
BASIC Joins (INNER, OUTER, FULL OUTER, Cross), filters, aggregation, a basic rank at a significant performance impact if all records are evaluated. (few other things) but not window functions, not cross Join, lateral...
as to why it has to do with maintenance and staffing. The staffed area is a reporting area who uses tools to create views used in universes. The area wishes to keep all Scripts out of use to keep cost for employees lower as SQL knowledge wouldn’t be required for future staff positions, though it helps!
For those familiar this issue is sourced from MBEWH table in an ECC implementation
This can be done with graphical calculation views in SAP HANA.
It's not pretty and probably not very efficient, though.
Whether or not the persons that are supposedly able to maintain graphical calc. views but not SQL statement will be able to successfully maintain this is rather questionable.
First, the approach in SQL, so that the approach becomes clear:
create column table calendar
( yr integer
, period nvarchar (2)
, primary key (yr, period))
insert into calendar
( select year (generated_period_start) as yr
, ABAP_NUMC( month(generated_period_start), 2) as period
from series_generate_date ('INTERVAL 1 MONTH', '2022-01-01', '2023-01-01'));
create column table data
( yr integer
, period nvarchar (2)
, qty integer
, primary key (yr, period));
insert into data values (2022, '01', 10);
insert into data values (2022, '02', 5);
insert into data values (2022, '04', 10);
insert into data values (2022, '05', 7);
insert into data values (2022, '09', 1);
SELECT *
FROM CALendar A
LEFT JOIN data B
on A.YR = B.YR
and A.Period = B.Period
WHERE A.Period <'10' and A.YR =2022
ORDER BY A.period;
/*
YR PERIOD YR PERIOD QTY
2,022 01 2,022 01 10
2,022 02 2,022 02 5
2,022 03 ? ? ?
2,022 04 2,022 04 10
2,022 05 2,022 05 7
2,022 06 ? ? ?
2,022 07 ? ? ?
2,022 08 ? ? ?
2,022 09 2,022 09 1
*/
The NUMC() function creates ABAP NUMC strings (with leading zeroes) from integers. Other than this it's pretty much the tables from OP.
The general approach is to use the CALENDAR table as the main driving table that establishes for which dates/periods there will be output rows.
This is outer joined with the DATA table, leaving "missing" rows with NULL in the corresponding columns.
Next, the DATA table is joined again, this time with YEAR||PERIOD combinations that are strictly smaller then the YEAR||PERIOD from the CALENDAR table. This gives us rows for all the previous records in DATA.
Next, we need to pick which of the previous rows we want to look at.
This is done via the ROWNUM() function and a filter to the first record.
As graphical calculation views don't support ROWNUM() this can be exchanged with RANK() - this works as long as there are no two actual DATA records for the same YEAR||PERIOD combination.
Finally, in the projection we use COALESCE to switch between the actual information available in DATA and - if that is NULL - the previous period information.
/*
CAL_YR CAL_PER COALESCE(DAT_YR,PREV_YR) COALESCE(DAT_PER,PREV_PER) COALESCE(DAT_QTY,PREV_QTY)
2,022 01 2,022 01 10
2,022 02 2,022 02 5
2,022 03 2,022 02 5
2,022 04 2,022 04 10
2,022 05 2,022 05 7
2,022 06 2,022 05 7
2,022 07 2,022 05 7
2,022 08 2,022 05 7
2,022 09 2,022 09 1
*/
So far, so good.
The graphical calc. view for that looks like this:
As it's cumbersome to screenshoot every single node, I will include the just most important ones:
1. CAL_DAT_PREV
Since only equality joins are supported in graphical calc. views we have to emulate the "larger than" join. For that, I created to calculated/constant columns join_const with the same value (integer 1 in this case) and joined on those.
2. PREVS_ARE_OLDER
This is the second part of the emulated "larger than" join: this projection simply filters out the records where cal_yr_per is larger or equal than prev_yr_per. Equal values must be allowed here, since we don't want to loose records for which there is no smaller YEAR||PERIOD combination. Alternatively, one could insert an intial record into the DATA table, that is guranteed to be smaller than all other entries, e.g. YEAR= 0001 and PERIOD=00 or something similar. If you're familiar with SAP application tables, then you've seen this approach.
By the way - for convenience reasons, I created calculated columns that combine the YEAR and PERIOD for the different tables - cal_yr_per, dat_yr_per, and prev_yr_per.
3. RANK_1
Here the rank is created for PREV_YR_PR, picking the first one only, and starting a new group for every new value fo cal_yr_per.
This value is returned via Rank_Column.
4. REDUCE_PREV
The final piece of the puzzle: using a filter on Rank_Column = 1 we ensure to only get one "previous" row for every "calendar" row.
Also: by means of IF(ISNULL(...), ... , ...) we emulate COALESCE(...) in three calculated columns, aptly named FILL....
And that's the nuts and bolts of this solution.
"It's works on my computer!" is probably the best I can say about it.
SELECT "CAL_YR", "CAL_PERIOD"
, "DAT_YR", "DAT_PER", "DAT_QTY"
, "FILL_YR", "FILL_QTY", "FILL_PER"
FROM "_SYS_BIC"."scratch/QTY_FILLUP"
ORDER BY "CAL_YR" asc, "CAL_PERIOD" asc;
/*
CAL_YR CAL_PERIOD DAT_YR DAT_PER DAT_QTY FILL_YR FILL_QTY FILL_PER
2,022 01 2,022 01 10 2,022 10 01
2,022 02 2,022 02 5 2,022 5 02
2,022 03 ? ? ? 2,022 5 02
2,022 04 2,022 04 10 2,022 10 04
2,022 05 2,022 05 7 2,022 7 05
2,022 06 ? ? ? 2,022 7 05
2,022 07 ? ? ? 2,022 7 05
2,022 08 ? ? ? 2,022 7 05
2,022 09 2,022 09 1 2,022 1 09
2,022 10 ? ? ? 2,022 1 09
2,022 11 ? ? ? 2,022 1 09
2,022 12 ? ? ? 2,022 1 09
*/

Adding parent to child document

I am using a database with two tables:
HEADING:(ID, CAPTION, CUSTOMER_ID,DATE, TDOC, ID_PAR) -- ID is PK
-- ID_PAR is FK Related
to HEADING.ID
LINES : (ID, NDOC,QTE) -- ID is PK NDOC is FK RELATED TO
HEADING.ID
HEADING has a recursive query to get all childs of a HEADING ID
for example an invoice with many delivery notes.
In that system the lines of an invoice are not typed we type only delivery notes
How to get all the lines of an invoice in case of the datasets are related as master-detail?
Heading dataset
ID PARENT_ID DOC_DATE DOC_TYPE
01 NULL A
02 01 B
03 02 C
04 02 C
Lines Dataset
ID_PK NDOC_FK CODE_PROD QTE
Heading
01 03 P1 5
02 03 P10 20
03 03 P67 65
04 04 P61 34
SQL LINES RESULT for heading.ID=01
PRODUCT QTE
P1 5
P10 20
P67 65
P61 61

find the difference in substring of timestamp in awk

I am trying to append some text in /var/log/messages output whenever the timestamp between the two consecutive log is different such as :
previous log: 00:01:59 and current log 00:02:00
or
previous log:00:01:49 and current log 00:01:50
above substring of timestamp if different in consecutive log, append some message to $0.
You may run below command it is working for 1 minute, needed it for 10 sec.
tail -f /var/log/messages |awk '{split($3,a,":");split($3,b,"");new_time=a[1]":"a[2]":"b[1]; if(prev_time==new_time) print $0; else print "10 Second group is over, starting new: "$0" "prev_time " "new_time } {split($3,a,":");split($3,b,"");prev_time=a[1]":"a[2]":"b[1]}'
Required result is modification of above command to print same message in 10 second gap of logs , currently its doing for 1 minute. I have used split() to capture 'HH:MM:S" not "HH:MM:SS",so whenever privious 'HH:MM:S" and current 'HH:MM:S"differ , print the message "10 Second group is over, starting new: $0". Not sure what is the mistake here.
In short, currently its working when a minute changes, I need it when second changes from 39 to 40th sec or 09 sec to 10 sec. NOT 11 sec to 12 sec. HH:MM:SS , S marked in bold needed to be changed.
Sample lines:
Jan 23 15:09:54 foo bar
Jan 23 15:10:04 bla bla
this is the general idea:
$ for((i=35;i<45;i++)); do echo Jan 23 00:01:$i; done |
awk '{split($3,a,":"); print $0, (p!=(c=int(a[3]/10))?"<<<":""); p=c}'
Jan 23 00:01:35 <<<
Jan 23 00:01:36
Jan 23 00:01:37
Jan 23 00:01:38
Jan 23 00:01:39
Jan 23 00:01:40 <<<
Jan 23 00:01:41
Jan 23 00:01:42
Jan 23 00:01:43
Jan 23 00:01:44
first part is the test data generated for the script since you didn't provide enough. There is spurious first line match, which can be eliminated with NR>1 condition but I don't think that's critical.

Count matching pairs

This is a little complicated so bear with me. Below is a table named "list" with 2 columns. The table has data of each member and the films that they like. i.e member 01 likes film 02, 05, 14, 21 and 25. What I want to find out is how many similar films does each member have with another member. For example, member 01 and member 02 have one film in common (film 14). Is there any way to write this in SQL?
List
------ ------
member film
------ ------
01 02
01 05
01 14
01 21
01 25
02 03
02 09
02 14
03 01
03 05
03 17
03 21
You can write a general query for this using a self-join and aggregation. The following summarizes the results for each pair of users:
select l1.member, l2.member, count(*) as NumFilmsInCommon
from list l1 join
list l2
on l1.member < l2.member and l1.film = l2.film
group by l1.member, l2.member;
The < condition just ensures that each pair of members only appears once in the result set.

SQL Server 2008 Express R2 Format query output to textfile

I have a large table (~ 6GB) which is bigger than my computers RAM that I'm trying to output from a SQL Server data base to a text file in a specific format to be used as an input for another program.
The data base looks looks something like this:
bought month day
0 11 01
0 10 07
1 01 06
1 04 03
0 08 03
0 07 03
0 02 03
0 11 05
1 09 02
... ... ...
I need the output file to be formatted like this:
0 | month:11 day:01
0 | month:10 day:07
1 | month:01 day:06
1 | month:04 day:03
0 | month:08 day:03
0 | month:07 day:03
0 | month:02 day:03
0 | month:11 day:05
1 | month:09 day:02
...
Any suggestions for how to do this?
If you're running SQL Server management studio, you can output to a file.
Go to Tools-> Options -> Query Results and you should be able to choose "Results to file" - specify the directory underneath, and run your query.
Should just need a bit of string concatenation.
SELECT bought + ' | month:' + month + ' day:' + day
FROM tablename
I must confess, I've never tried this on such a large dataset so I can't tell you how it will perform, however once you've entered the query, you can tell SSMS to send the output to a text file as per Bridge's answer.