Pad column with n zeros and trim excess values - awk

For example, the original data file
file.org :
1 2 3 4 5
6 7 8 9 0
11 12 13 14 15
16 17 18 19 20
21 22 23 24 25
Insert three data points (0) in column 2,
The output file should look like this
file.out :
1 0 3 4 5
6 0 8 9 0
11 0 13 14 15
16 2 18 19 20
21 7 23 24 25
Please help.

The following awk will do the trick:
awk -v n=3 '{a[NR]=$2; $2=a[NR-n]+0}1' file

$ awk -v n=3 '{x=$2; $2=a[NR%n]+0; a[NR%n]=x} 1' file
1 0 3 4 5
6 0 8 9 0
11 0 13 14 15
16 2 18 19 20
21 7 23 24 25

If you want to try Perl,
$ cat file.orig
1 2 3 4 5
6 7 8 9 0
11 12 13 14 15
16 17 18 19 20
21 22 23 24 25
$ perl -lane ' BEGIN { push(#t,0,0,0) } push(#t,$F[1]);$F[1]=shift #t; print join(" ",#F) ' file.orig
1 0 3 4 5
6 0 8 9 0
11 0 13 14 15
16 2 18 19 20
21 7 23 24 25
$

EDIT: Since OP has edited question so adding solution as per new question.
awk -v count=3 '++val<=count{a[val]=$2;$2=0} val>count{if(++re_count<=count){$2=a[re_count]}} 1' Input_file
Output will be as follows.
1 0 3 4 5
6 0 8 9 0
11 0 13 14 15
16 2 18 19 20
21 7 23 24 25
Could you please try following.
awk -v count=5 '
BEGIN{
OFS="\t"
}
$2{
val=(val?val ORS OFS:OFS)$2
$2=0
occ++
$1=$1
}
1
END{
while(++occ<=count){
print OFS 0
}
print val
}' Input_file
Output will be as follows.
1 0 3 4 5
6 0 8 9 0
11 0 13 14 15
0
0
2
7
12

Related

ValueError: Data must be 1-dimensional......verify_integrity

Bonjour,
I don't understand why this issue occurs.
print("p.shape= ", p.shape)
print("dfmj_dates['deces'].shape = ",dfmj_dates['deces'].shape)
cross_dfmj = pd.crosstab(p, dfmj_dates['deces'])
That produces:
p.shape= (683, 1)
dfmj_dates['deces'].shape = (683,)
----> 3 cross_dfmj = pd.crosstab(p, dfmj_dates['deces'])
--> 654 df = DataFrame(data, index=common_idx)
--> 614 mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
--> 589 val = sanitize_array(
--> 576 subarr = _sanitize_ndim(subarr, data, dtype, index, allow_2d=allow_2d)
--> 627 raise ValueError("Data must be 1-dimensional")
ValueError: Data must be 1-dimensional
From me, I suspect issue comes from the difference between (683, 1)
and (683,). I tried something like p.flatten(order = 'C') to get
(683,) but pd.DataFrame(dfmj_dates['deces']) too. That failed.
Do you have any idea? Regards, Atapalou
print(p.head(30))
print(df.head(30))
that produces
week
0 8
1 8
2 8
3 9
4 9
5 9
6 9
7 9
8 9
9 9
10 10
11 10
12 10
13 10
14 10
15 10
16 10
17 11
18 11
19 11
20 11
21 11
22 11
23 11
24 12
25 12
26 12
27 12
28 12
29 12
deces
0 0
1 1
2 0
3 0
4 0
5 1
6 0
7 0
8 0
9 0
10 1
11 1
12 0
13 3
14 4
15 5
16 3
17 11
18 3
19 15
20 13
21 18
22 12
23 36
24 21
25 27
26 69
27 128
28 78
29 112
Try to squeeze p:
cross_dfmj = pd.crosstab(p.squeeze(), dfmj_dates['deces'])
Example:
p = np.random.random((5, 1))
p.shape
# (5, 1)
p.squeeze().shape
# (5,)

Unpivot a data-frame that has information of two teams in one row?

I have some data that holds information about two opposing teams
home_x away_x
0 7 28
1 11 10
2 11 20
3 12 15
4 12 16
I know about .melt(), which returns something like this:
variable value
0 home_x 7
1 home_x 11
2 home_x 11
3 home_x 12
4 home_x 12
So each value is a row here.
There are several attributes for each team.
I want each row to have all the attributes for the respective team( home or away)
The ultimate goals is to have all the attributes of both teams in one row. This would double the number of rows.
home_x away_x
0 7 28
would be transformed into:
team1_x team2_x
0 7 28
0 28 7
sample df:
home_x
away_x
home_y
away_y
0
7
28
7
20
1
28
7
28
13
2
28
7
28
4
3
7
28
7
58
4
11
10
11
10
try:
res = pd.DataFrame()
for c in df.columns.str.split("_").str[1].unique():
p1 = df.filter(regex=f"{c}$")
c1,c2 =p1.columns
df_map = {c1:c2, c2:c1}
swap = p1.rename(columns={**df_map})
res = pd.concat([res,p1.append(swap).sort_index(ignore_index=True)], axis=1)
then rename the columns.
import re
repl = {'home': 'team1', 'away': 'team2'}
res.columns = [re.sub('|'.join(repl.keys()), lambda x: repl[x.group()], i) for i in res.columns]
team1_x
team2_x
team1_y
team2_y
0
7
28
7
20
1
28
7
20
7
2
28
7
28
13
3
7
28
13
28
4
28
7
28
4
5
7
28
4
28
6
7
28
7
58
7
28
7
58
7
8
11
10
11
10
9
10
11
10
11
Here is an approach:
You might need to group on the last split of the column names and then group on axis=1, then iterate through the groups and reverse the column order and name them same with the suffix:
def myinfo(data):
c = data.columns.str.split("_").str[-1]
f = lambda x: pd.DataFrame.set_axis(x, ["team1","team2"],axis=1)
l = [pd.concat([*map(f , (v,v.iloc[:,::-1]))]).add_suffix(f"_{k}")
for k,v in data.groupby(c,axis=1)]
return pd.concat(l,axis=1).sort_index()
print(myinfo(df))
team1_x team2_x
0 7 28
0 28 7
1 11 10
1 10 11
2 11 20
2 20 11
3 12 15
3 15 12
4 12 16
4 16 12

Keep only the first value on duplicated column (set 0 to others)

Supposing I have the following situation:
A dataframe where the first column ['ID'] will eventually have duplicated values.
import pandas as pd
df = pd.DataFrame({"ID": [1,2,3,4,4,5,5,5,6,6],
"l_1": [10,12,32,45,45,20,20,20,20,20],
"l_2": [11,12,32,11,21,27,38,12,9,6],
"l_3": [5,9,32,12,21,21,18,12,8,1],
"l_4": [6,21,12,77,77,2,2,2,8,8]})
ID l_1 l_2 l_3 l_4
1 10 11 5 6
2 12 12 9 21
3 32 32 32 12
4 45 11 12 77
4 45 21 21 77
5 20 27 21 2
5 20 38 18 2
5 20 12 12 2
6 20 9 8 8
6 20 6 1 8
When duplicated IDs occurs:
I need to keep only the first values for column l_1 and l_4 (other duplicated rows must be zero).
Columns 'l_2' and 'l_3' must stay the same.
When duplicated IDs occurs, the values on these rows on columns l_1 and l_4 will be also duplicated.
Expected output:
ID l_1 l_2 l_3 l_4
1 10 11 5 6
2 12 12 9 21
3 32 32 32 12
4 45 11 12 77
4 0 21 21 0
5 20 27 21 2
5 0 38 18 0
5 0 12 12 0
6 20 9 8 8
6 0 6 1 0
Is there a Straightforward way using pandas or numpy to accomplish this ?
I could just accomplish it doing all these steps:
x1 = df[df.duplicated(subset=['ID'], keep=False)].copy()
x1.loc[x1.groupby('ID')['l_1'].apply(lambda x: (x.shift(1) == x)), 'l_1'] = 0
x1.loc[x1.groupby('ID')['l_4'].apply(lambda x: (x.shift(1) == x)), 'l_4'] = 0
df = df.drop_duplicates(subset=['ID'], keep=False)
df = pd.concat([df, x1])
Isn't this just:
df.loc[df.duplicated('ID'), ['l_1','l_4']] = 0
Output:
ID l_1 l_2 l_3 l_4
0 1 10 11 5 6
1 2 12 12 9 21
2 3 32 32 32 12
3 4 45 11 12 77
4 4 0 21 21 0
5 5 20 27 21 2
6 5 0 38 18 0
7 5 0 12 12 0
8 6 20 9 8 8
9 6 0 6 1 0

Removing rows and keeping consecutive rows pandas

I would like to omit the first row and keep x consecutive rows.
in the example below i would like to keep 7. How do i achieve this?
df = pd.Series(range(1,101)).to_frame()
df.columns = ['numbers']
df['numbers'][1::7]
1 2
8 9
15 16
22 23
29 30
36 37
43 44
50 51
57 58
64 65
71 72
78 79
85 86
92 93
99 100
I would like to keep the values below but continue to the next row sequence.
so remove 1 then keep 2 to 7. then remove 8 and keep 9 to 14
df = pd.Series(range(1,101)).to_frame()
df.columns = ['numbers']
df['numbers'][1:7]
1 2
2 3
3 4
4 5
5 6
6 7
Or loc:
df.loc[df.index % 7 != 0]
giving
numbers
1 2
2 3
3 4
4 5
5 6
6 7
8 9
9 10
10 11
11 12
12 13
13 14
15 16
16 17
... ...
drop
df.drop(df.index[::7])
numbers
1 2
2 3
3 4
4 5
5 6
6 7
8 9
9 10
10 11
11 12
12 13
13 14
15 16
16 17
17 18
18 19
.. ...

awk cumulative sum in on dimension

Good afternoon,
I would like to make a cumulative sum for each column and line in awk.
My in file is :
1 2 3 4
2 5 6 7
2 3 6 5
1 2 1 2
And I would like : per column
1 2 3 4
3 7 9 11
5 10 15 16
6 12 16 18
6 12 16 18
And I would like : per line
1 3 5 9 9
2 7 13 20 20
2 5 11 16 16
1 3 4 6 6
I did the sum per column as :
awk '{ for (i=1; i<=NF; ++i) sum[i] += $i}; END { for (i in sum) printf "%s ", sum[i]; printf "\n"; }' test.txt # sum
And per line .
awk '
BEGIN {FS=OFS=" "}
{
sum=0; n=0
for(i=1;i<=NF;i++)
{sum+=$i; ++n}
print $0,"sum:"sum,"count:"n,"avg:"sum/n
}' test.txt
But I would like to print all the lines and columns.
Do you have an idea?
It looks like you have all the correct information available, all you are missing is the printout statements.
Is this what you are looking for?
accumulated sum of the columns:
% cat foo
1 2 3 4
2 5 6 7
2 3 6 5
1 2 1 2
% awk '{ for (i=1; i<=NF; ++i) {sum[i]+=$i; $i=sum[i] }; print $0}' foo
1 2 3 4
3 7 9 11
5 10 15 16
6 12 16 18
accumulated sum of the rows:
% cat foo
1 2 3 4
2 5 6 7
2 3 6 5
1 2 1 2
% awk '{ sum=0; for (i=1; i<=NF; ++i) {sum+=$i; $i=sum }; print $0}' foo
1 3 6 10
2 7 13 20
2 5 11 16
1 3 4 6
Both these make use of the following :
each variable has value 0 by default (if used numerically)
I replace the field $i with what the sum value
I reprint the full line with print $0
row sums with repeated last element
$ awk '{s=0; for(i=1;i<=NF;i++) $i=s+=$i; $i=s}1' file
1 3 6 10 10
2 7 13 20 20
2 5 11 16 16
1 3 4 6 6
$i=s sets the index value (now incremented to NF+1) to the sum and 1 prints the line with that extra field.
columns sums with repeated last row
$ awk '{for(i=1;i<=NF;i++) c[i]=$i+=c[i]}1; END{print}' file
1 2 3 4
3 7 9 11
5 10 15 16
6 12 16 18
6 12 16 18
END{print} repeats the last row
ps. your math seems to be wrong for the row sums