AWK - removal of the same fields on the basis of the "$1" - awk

I have a file1:
6
3
6
9
2
6
This command prints the result:
awk 'NR==1{a=$1};$0!=a' file1
3
9
2
Now I have file2:
6 1 2 3 4 5
3 3 4 4 4 6
6 5 2 2 5 1
9 1 3 5 4 1
2 5 6 4 8 5
6 1 5 2 3 1
I want to do the same thing, but with file2. I want to print out the result:
3 3 4 4 5 6
9 5 3 2 8 1
2 5 6 5 3 1
5 4 1
2
I want to do it in awk. Thank you for your help.

AWK is not really suited for what you are trying to do, since it is made for processing rows one at a time, while you are trying to shift numbers up and down between different rows. That said, this monster should do what you want:
awk 'NR==1{nc=NF;for(i=1;i<=nc;i++)a[i]=$i}{for(i=1;i<=nc;i++){if($i!=a[i]){v[m[i]++,i]=$i;if(m[i]>nl)nl=m[i]}}}END{for(l=0;l<nl;l++){for(i=1;i<=nc;i++){if(l<m[i]){printf("%d ", v[l,i])}else{printf(" ")}}printf("\n")}}'
If, on the other hand, your matrix of numbers had been transposed, this task would have been far simpler:
awk '{for(i=2;i<=NF;i++)if($i!=$1)printf(" %d",$i);printf("\n")}'

Related

If a column value does not have a certain number of occurances in a dataframe, how to duplicate rows at random until that count is met?

Say that this is what my dataframe looks like
A B
0 1 5
1 4 2
2 3 5
3 3 3
4 3 2
5 2 0
6 4 5
7 2 3
8 4 1
9 5 1
I want every unique value in Column B to occur at least 3 times. So none of the rows with a B value of 5 are duplicated. The row with a column B value of 0 are duplicated twice. And the rest have one of their two rows duplicated at random.
Here is an example desired output
A B
0 1 5
1 4 2
2 3 5
3 3 3
4 3 2
5 2 0
6 4 5
7 2 3
8 4 1
9 5 1
10 4 2
11 2 3
12 2 0
13 2 0
14 4 1
Edit:
The row chosen to be duplicated should be selected at random
To random pick rows, I would use groupby apply with sample on each group. x of lambda is each group of B, so I use reapeat - x.shape[0] to find number of rows need to create. There may be some cases group B already has more rows than 3, so I use np.clip to force negative values to 0. Sample on 0 row is the same as ignore it. Finally, reset_index and append back to df
repeats = 3
df1 = (df.groupby('B').apply(lambda x: x.sample(n=np.clip(repeats-x.shape[0], 0, np.inf)
.astype(int), replace=True))
.reset_index(drop=True))
df_final = df.append(df1).reset_index(drop=True)
Out[43]:
A B
0 1 5
1 4 2
2 3 5
3 3 3
4 3 2
5 2 0
6 4 5
7 2 3
8 4 1
9 5 1
10 2 0
11 2 0
12 5 1
13 4 2
14 2 3

Comparing two dataframe and output the index of the duplicated row once

I need help with comparing two dataframes. For example:
The first dataframe is
df_1 =
0 1 2 3 4 5
0 1 1 1 1 1 1
1 2 2 2 2 2 2
2 3 3 3 3 3 3
3 4 4 4 4 4 4
4 2 2 2 2 2 2
5 5 5 5 5 5 5
6 1 1 1 1 1 1
7 6 6 6 6 6 6
The second dataframe is
df_2 =
0 1 2 3 4 5
0 1 1 1 1 1 1
1 2 2 2 2 2 2
2 3 3 3 3 3 3
3 4 4 4 4 4 4
4 5 5 5 5 5 5
5 6 6 6 6 6 6
May I know if there is a way (without using for loop) to find the index of the rows of df_1 that have the same row values of df_2. In the example above, my expected output is below
index =
0
1
2
3
5
7
The size of the column of the "index" variable above should have the same column size of df_2.
If the same row of df_2 repeated in df_1 more than once, I only need the index of the first appearance, thats why I don't need the index 4 and 6.
Please help. Thank you so much!
Tommy
Use DataFrame.merge with DataFrame.drop_duplicates and DataFrame.reset_index for convert index to column for avoid lost index values, last select column called index:
s = df_2.merge(df_1.drop_duplicates().reset_index())['index']
print (s)
0 0
1 1
2 2
3 3
4 5
5 7
Name: index, dtype: int64
Detail:
print (df_2.merge(df_1.drop_duplicates().reset_index()))
0 1 2 3 4 5 index
0 1 1 1 1 1 1 0
1 2 2 2 2 2 2 1
2 3 3 3 3 3 3 2
3 4 4 4 4 4 4 3
4 5 5 5 5 5 5 5
5 6 6 6 6 6 6 7
Check the solution
df1=pd.DataFrame({'0':[1,2,3,4,2,5,1,6],
'1':[1,2,3,4,2,5,1,6],
'2':[1,2,3,4,2,5,1,6],
'3':[1,2,3,4,2,5,1,6],
'4':[1,2,3,4,2,5,1,6],
'5':[1,2,3,4,2,5,1,6]})
df1=pd.DataFrame({'0':[1,2,3,4,5,6],
'1':[1,2,3,4,5,66],
'2':[1,2,3,4,5,6],
'3':[1,2,3,4,5,66],
'4':[1,2,3,4,5,6],
'5':[1,2,3,4,5,6]})
df1[df1.isin(df2)].index.values.tolist()
### Output
[0, 1, 2, 3, 4, 5, 6, 7]

Variation with reputation for rows of file

I have a file
4 5 6 6
1 7 5 5
7 0 2 1
7 8 0 6
and I would like to produce files that have random chosen rows from this file with reputation. So the outputs can be for instance:
4 5 6 6
1 7 5 5
1 7 5 5
7 8 0 6
7 8 0 6
1 7 5 5
1 7 5 5
7 8 0 6
I mean that some of the rows will be in the output more times, some of the rows zero times. Is it possible to produce a list of random numbers with reputation and according to it choose rows from input? Is it possible in awk or is some other language more appropriate?
If this isn't all you need:
$ shuf -n $(wc -l < file) -r file
4 5 6 6
7 8 0 6
1 7 5 5
1 7 5 5
then edit your question to clarify your requirements.
Don't understand the concept of reputation but here is a way using just randomness:
$ awk -v seed=$RANDOM '{ # set the random seed externally
a[NR]=$0 # hash records to a
}
END {
srand(seed)
for(i=1;i<=4;i++) # 4 is the number of records to output
print a[int(1+rand()*NR)] # get a random array element and output it
}' file
An example of output:
7 8 0 6
7 8 0 6
7 8 0 6
1 7 5 5
You could also do this with coreutils shuf and sed, e.g.:
n=$(wc -l < infile)
shuf -n $n -i 1-$n -r | sed 's/$/p/' | sed -nf - infile
Output example:
4 5 6 6
4 5 6 6
1 7 5 5
1 7 5 5

pandas drop duplicate row value from a specific column

I want to remove the duplicate row value from a specific column - in this case the column name is "number".
Before:
number qty status
0 10 2 go
1 10 5 nogo
2 4 6 yes
3 3 1 no
4 2 7 go
5 5 2 nah
6 5 6 go
7 5 3 nogo
8 1 10 yes
9 1 10 go
10 5 2 nah
After:
number qty status
0 10 2 go
5 nogo
1 4 6 yes
2 3 1 no
3 2 7 go
4 5 2 nah
6 go
3 nogo
5 1 10 yes
10 go
6 5 2 nah
It is possible replace values to empty string or NaNs by mask with duplicated by new Series a created by comparing column with shifted column with cumsum:
a = df['number'].ne(df['number'].shift()).cumsum()
#for replace ''
df['number'] = df['number'].mask(a.duplicated(), '')
#for replace NaNs
#df['number'] = df['number'].mask(a.duplicated())
print (df)
number qty status
0 10 2 go
1 5 nogo
2 4 6 yes
3 3 1 no
4 2 7 go
5 5 2 nah
6 6 go
7 3 nogo
8 1 10 yes
9 10 go
10 5 2 nah
Detail:
a = df['number'].ne(df['number'].shift()).cumsum()
print (a)
0 1
1 1
2 2
3 3
4 4
5 5
6 5
7 5
8 6
9 6
10 7
Name: number, dtype: int32

CPlex coding logic

The professor in charge of an industrial engineering design course is faced with the problem of assigning 28 students to 8 projects. Each student must be assigned to one project and each project group must have 3 or 4 students. The students have been asked to rank the projects, with 1 being the best ranking and higher numbers representing lower rankings.
a) Formulate an OPL model for this problem.
b) Solve the assignment problem for the following table of assignments:
A ED EZ G H1 H2 RB SC
Allen 1 3 4 7 7 5 2 6
Black 6 4 2 5 5 7 1 3
Chung 6 2 3 1 1 7 5 4
Clark 7 6 1 2 2 3 5 4
Conners 7 6 1 3 3 4 5 2
Cumming 6 7 4 2 2 3 5 1
Demming 2 5 4 6 6 1 3 7
Eng 4 7 2 1 1 6 3 5
Farmer 7 6 5 2 2 1 3 4
Forest 6 7 2 5 5 1 3 4
Goodman 7 6 2 4 4 5 1 3
Harris 4 7 5 3 3 1 2 6
Holmes 6 7 4 2 2 3 5 1
Johnson 2 2 4 6 6 5 3 1
Knorr 7 4 1 2 2 5 6 3
Manheim 4 7 2 1 1 3 6 5
Morris 7 5 4 6 6 3 1 2
Nathan 4 7 5 6 6 3 1 2
Neuman 7 5 4 6 6 3 1 2
Patrick 1 7 5 4 4 2 3 6
Rollins 6 2 3 1 1 7 5 4
Schuman 4 7 3 5 5 1 2 6
Silver 4 7 3 1 1 2 5 6
Stein 6 4 2 5 5 7 1 3
Stock 5 2 1 6 6 7 4 3
Truman 6 3 2 7 7 5 1 4
Wolman 6 7 4 2 2 3 5 1
Young 1 3 4 7 7 6 2 5
How many students are assigned their second or third choice?
c) Some of the projects are harder than others to reach without a car. Thus, it is desirable that at least a certain number of students assigned to each project must have a car; the numbers vary by project as follows:
A ED EZ G H1 H2 RB SC
1 0 0 2 2 2 1 1
The students who have cars are Chung, Demming, Eng, Holmes, Manheim, Morris, Nathan, Patrick, Rollins and Young.
Modify the model to add this car constraint and solve the problem again. How many more students than before must be assigned second or third choices?
I coded the file for a) & b) but i am getting stuck at c).
can anyone help pls with the logic? even ampl wil suffice
Let C_i be the indicator matrix (input): C_i = 1 if student i has a car and 0 otherwise. I'll assume you have the following decision variables:
x_ij = 1 if student i is assigned to project j; 0 otherwise
then c) constraint can me modeled as follows
sum_i C_i * x_ij >= b_j for all j
where b_j is
j A ED EZ G H1 H2 RB SC
b_j 1 0 0 2 2 2 1 1