awk: print values of one field on another field

awk: print values of one field on another field - awk

I have a csv file with 2 fields: date ($1) and daily temperatures throughout the year ($2) and I want to extract the temperatures from April to September, but each month on another column like this:
April
May
17 C
20 C
15 C
22 C
15 C
21 C
...
Using the following command I get a temp.csv file with all temperatures in a single column:
awk ' /2020-04/ {print $2}' year-temperatures.csv >> temp.csv
awk ' /2020-05/ {print $2}' year-temperatures.csv >> temp.csv
awk ' /2020-06/ {print $2}' year-temperatures.csv >> temp.csv
What should be done to put each month be in another column?

Look at the following script (temperature.awk):
BEGIN {
SUBSEP="#"
}
{
month=0+substr($1,6,2);
day=0+substr($1,9,2);
a[month,day]=$2;
}
END{
printf("%5s ","")
for (month=1; month<=12; month++) {
printf("%5s ", month);
}
printf("\n");
for (day=1; day<=31; day++) {
printf("%4s: ", day)
for (month=1;month<=12; month++) {
printf("%5s ", a[month,day])
}
printf("\n")
}
}
When doing: gawk -F, -f temperature.awk year-temperatures.csv >> temp.csv
Your temp.csv should look sometheing like this (with my test data):
1 2 3 4 5 6 7 8 9 10 11 12
1: 17.5 19.9 21.5 19.6 18.7 14.2 18.5 18.9 15.9 14.3 21.4 21.4
2: 18.6 20.7 17.6 14 12.7 13.4 17.1 12.3 21.6 17.3 18.8 12.8
3: 18.3 21.8 21.8 19.1 15.6 12.5 18 12.8 18.5 21.7 17.6 17.8
4: 14 14.7 13.9 21.6 18 20.3 16.8 15 15.7 14.4 19.5 18.7
5: 12.7 16.3 12.3 18.7 20.9 12.1 18.1 14.5 21.1 15 12.6 18.1
6: 19.7 15.2 17.7 16.5 18.6 17.4 17.9 15.4 16.4 19.9 12.7 12.2
7: 18.3 15.1 19.7 14.6 18.2 18.7 13.2 21.8 16.5 12.4 13.8 15
8: 20.2 18.2 13.5 21.3 13.4 19.4 20.2 20.6 21.5 20.3 18.7 16.2
9: 14.4 13.4 16.4 20.8 20.3 18.8 19.5 15.7 15.7 12.4 20.3 14.1
10: 19.4 20.7 19.3 18.2 19.4 14 14.9 14.7 12.2 19.1 13.2 20
11: 21.8 21.2 15.2 16.7 14 21.4 14.1 14.5 12.1 16.3 13.4 15.8
12: 18.8 21.9 16.2 16.7 20 13.3 13.8 16.2 21.6 12.2 15.1 16.8
13: 16.5 14 13.4 21.5 16 20 14.7 15.5 19.7 20 13.4 14.7
14: 14.3 12.2 16.2 15.5 18 18.1 20 17 21.9 21.3 19.9 21.2
15: 20 16.9 19.1 21.1 19.7 18.4 14.1 16.3 18.5 14.6 17.2 19.7
16: 15.1 16.1 14.8 16.9 12.8 15.8 18.2 18.5 14.7 16.9 14.1 13.1
17: 13.3 17.7 14.7 19.2 12.9 21.6 16.8 21.6 16.2 19 17.1 14.1
18: 19.5 18.3 17.3 13.3 14.2 18.9 17.4 20.4 14.6 12.4 21.3 19.5
19: 15.4 16.3 20.1 16.8 20.2 17.6 14.4 15.4 12.6 12.8 13 13
20: 16.8 14.7 16.6 12.2 16.2 19.3 18 13.8 17 14.9 19 14.5
21: 15.4 12.4 20.6 18.6 18.7 21.8 14.7 20.6 15.1 13.9 14.1 21.8
22: 14.9 16.1 21.4 14.4 12.8 19.2 17.5 19.5 12.8 12.7 21.5 13.1
23: 16.3 21.1 12.9 14.3 16.1 18.6 21.3 13.9 16.6 20.2 13.2 18.5
24: 14.9 15.3 18.7 16.3 19.8 13.5 12.1 19 12.7 20.5 19.5 20.9
25: 13.3 21 12.5 16.5 18.9 19.4 14.8 21.3 21.5 20.2 15.9 17
26: 20 17.4 14.4 21.7 12.8 14.6 15.5 17.4 17.5 17.5 18.9 20.2
27: 18 12 12.5 17.1 15.7 12.9 21 21.2 20.8 15 14.8 18.3
28: 17.9 15.9 17.6 18.2 17.7 18.5 16.7 21.8 19.6 20.2 15.6 18.7
29: 13.8 18.2 17.9 19.7 21.7 18.6 13.4 13.7 14.1 21.2 16.7
30: 13.1 16.1 12.9 13.3 21.1 20.9 19.5 17.5 18 17.4 15.3
31: 12.3 14 15.2 16.7 15.3 15.5 14.4
The first couple of lines from my testdata look like this:
2022-01-01, 17.5
2022-01-02, 18.6
2022-01-03, 18.3
2022-01-04, 14
2022-01-05, 12.7
2022-01-06, 19.7
2022-01-07, 18.3
2022-01-08, 20.2

Related

Pandas Boxplot Highlight Specific Values in DF

I have a df called "YMp" and I have made a boxplot for the data with the dates 1991 - 2019 but I need to show the current year (2020) values as colored points or values with a legend showing the year 2020 over-plotted on the boxplot.
The data looks like this -
month 01 02 03 04 05 06 07 08 09 10 11 12
year
1991 -4.9 12.2 -11.1 -18.0 -27.5 1.7 7.4 22.7 38.3 4.2 -0.9 5.3
1992 -10.9 -17.1 -7.7 14.8 14.8 -9.6 17.0 24.7 32.3 0.3 -21.6 15.3
1993 -1.8 -2.3 -3.8 0.4 -4.8 -7.7 11.7 26.3 17.1 2.6 4.4 2.4
1994 2.6 2.5 -6.2 -3.2 2.2 -3.0 13.8 3.9 30.4 -25.7 -1.8 -2.2
1995 -8.6 -3.3 -18.4 -14.0 -19.3 13.2 9.8 -23.2 16.0 -15.2 0.6 -8.5
1996 -5.5 -10.4 -0.3 7.2 13.0 3.6 5.2 1.4 -10.3 -2.9 15.4 -0.6
1997 -11.1 8.9 -1.1 -12.7 3.0 -4.0 27.1 32.6 -4.5 -15.5 -5.5 -20.9
1998 -22.0 -16.6 3.2 0.7 15.4 16.0 18.5 2.7 -32.3 16.3 -5.4 12.9
1999 -1.0 0.3 -8.5 9.9 7.4 -2.1 10.9 -5.5 18.5 17.4 17.5 11.1
2000 5.4 12.9 -24.8 15.7 -9.3 20.7 18.2 23.2 16.6 26.8 -17.7 17.3
2001 -3.9 14.5 -4.7 18.6 5.6 22.4 -3.3 18.2 5.3 31.2 6.0 -4.0
2002 -9.0 19.5 12.5 24.5 27.6 -9.3 3.7 13.7 -32.7 -19.5 0.7 -6.1
2003 23.6 -11.7 -16.5 -2.1 6.5 -13.7 0.4 8.0 -13.7 -16.1 7.3 13.1
2004 6.6 4.7 36.8 12.8 29.5 6.4 -12.2 -0.6 -7.7 -15.2 -1.1 12.7
2005 6.3 1.1 -14.6 9.4 -7.5 6.1 -9.2 -1.3 36.1 -4.9 10.8 -11.7
2006 7.3 8.3 1.7 11.8 -14.7 33.3 9.1 -0.0 3.0 1.4 -2.8 8.8
2007 5.4 0.2 7.2 -3.9 6.6 -8.3 -28.2 -7.6 3.3 -7.4 25.0 -7.3
2008 5.0 -5.6 7.6 -0.4 -1.2 13.9 -11.3 -29.7 16.7 43.1 2.4 3.5
2009 -2.2 17.1 9.8 8.9 -9.2 -14.4 6.1 21.7 -0.2 -26.7 -9.1 -18.2
2010 -2.6 -12.1 0.8 -16.5 4.1 3.9 -21.5 -3.3 -18.9 22.8 -6.5 -5.3
2011 -12.4 -3.8 1.2 -14.9 -2.0 6.8 -12.6 -16.9 8.3 10.7 -0.7 4.6
2012 0.5 -3.0 -1.0 -6.5 7.5 -17.9 -4.3 -26.3 -2.6 3.0 12.3 -15.3
2013 -1.7 -15.1 18.8 -8.3 7.5 -4.5 -19.3 0.9 -33.9 -10.6 -0.4 4.4
2014 7.2 -20.0 -8.4 2.0 10.1 -20.2 7.8 -14.9 -11.4 -6.9 -0.3 6.4
2015 18.4 6.2 10.5 -16.5 -11.9 7.0 -7.3 -6.7 -20.8 -13.9 -3.3 -14.8
2016 -11.3 28.5 -9.2 -4.2 -9.7 1.0 -5.1 -18.9 -3.3 19.1 -1.1 10.1
2017 -8.6 -8.1 21.2 4.5 -21.2 -28.5 -6.8 -30.8 -19.7 13.3 7.2 9.9
2018 26.9 3.1 -7.1 -3.4 -8.7 -15.5 12.8 3.9 -16.4 -7.9 -25.7 -9.2
2019 2.1 -17.3 10.2 1.0 -13.5 -3.4 -14.0 -20.7 -1.5 -28.4 -5.4 -13.9
The current year 2020 data looks like this -
2020 0.3 6.2 2.0 -17.9 -0.4 6.0 -24.5 2.5 -12.1 4.6 NaN NaN
My boxplot looks like this without the 2020 data plotted or highlighted below. Thank you for helping with ideas about doing this.

Try catching the axis instance and plot again:
ax = df.boxplot()
ax.scatter(np.arange(df.shape[1])+1, df.loc[2000], color='r')
Output:

I need to group a result set

I am using SQL Server2014. I have a problem
select *
from (
select [DataTime] datatime,[Temperature] temperature,[Humidity] humidity,
b.[serialnumber] serialnumber, Row_Number() OVER(ORDER BY a.datatime) rownum
from [dbo].[datalog] a,[report_devicelist] b where a.deviceno = b.deviceno and b.report_no='201906140013yEcD'
and a.datatime between '2019-04-09 15:05:00' and '2019-04-09 16:20:52'
) as t where rownum between 1 and 50
time temperature humidity serialnumber rownum
2019-04-09 15:05:01 268 0 ch4 1
2019-04-09 15:05:01 272 0 ch5 2
2019-04-09 15:05:01 266 0 ch6 3
2019-04-09 15:05:01 264 0 ch7 4
2019-04-09 15:05:01 263 0 ch8 5
2019-04-09 15:06:01 253 0 ch3 15
2019-04-09 15:06:01 245 0 ch2 16
2019-04-09 15:06:01 257 0 ch1 17
2019-04-09 15:06:01 272 0 ch14 18
2019-04-09 15:06:01 250 0 ch13 19
2019-04-09 15:06:01 254 0 ch12 20
2019-04-09 15:06:01 263 0 ch11 21
2019-04-09 15:06:01 256 0 ch10 22
time ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9
2019/03/05 11:41:01 16.9 15.3 17.2 17.1 15.2 16.9 17.4 16.1 17.1
2019/03/05 11:42:01 16.5 15.4 16.8 16.6 14.8 16.7 17.0 15.9 16.3
2019/03/05 11:43:01 16.3 15.5 16.6 16.2 14.5 16.5 16.6 15.9 15.9
2019/03/05 11:44:01 16.4 15.3 16.7 15.9 14.4 16.9 16.3 16.1 15.6
2019/03/05 11:45:01 16.8 15.2 16.7 15.7 14.3 16.7 16.0 16.6 15.4
2019/03/05 11:46:01 16.6 15.1 16.9 15.4 14.2 16.5 15.8 16.7 15.3
2019/03/05 11:47:01 16.6 15.2 17.4 15.4 14.3 17.7 15.9 16.6 15.3
2019/03/05 11:48:01 16.2 15.0 17.1 15.4 14.2 17.5 15.8 16.4 15.3
2019/03/05 11:49:01 15.8 14.5 16.8 15.2 14.1 17.1 15.5 16.1 15.1
2019/03/05 11:50:01 15.2 13.7 16.4 14.8 13.9 16.4 14.9 15.5 14.8
2019/03/05 11:51:01 14.6 12.7 15.8 14.3 13.5 15.6 14.2 14.8 14.2
I need to group the following results based on the event.
That is, all the serial number temperatures need to be listed at each time point.I tried to use datatime group by.But the other columns are not in the aggregate function.

It seems like you need to utilize pivot table as:
select top 50 datatime,[ch1], [ch2], [ch3], [ch4], [ch5], [ch6], [ch7], [ch8], [ch9]
from
(select
[DataTime] datatime,[Temperature] temperature
b.[serialnumber] serialnumber
from [dbo].[datalog] a,[report_devicelist] b where a.deviceno = b.deviceno and b.report_no='201906140013yEcD'
and a.datatime between '2019-04-09 15:05:00' and '2019-04-09 16:20:52'
) AS SourceTable
PIVOT
(
AVG(temperature)
FOR serialnumber IN ([ch1], [ch2], [ch3], [ch4], [ch5], [ch6], [ch7], [ch8], [ch9])
) AS PivotTable
Order by datatime;
and limit the number of queries to top 50 that sorted by [datatime].

Treat with the two last digits (dd) of each row in a 'date' (yyyymmdd) column in Pandas df

I'm trying to treat an entire column of date values to change it in a column of numbers from "1" to "the last day of the month" in a Pandas dataframe.
The code has to be able to deal with columns of 28,29,30 or 31 values depending on which month is concerned.
So my df:
DAY TX TN
0 20190201 4.9 -0.6
1 20190202 2.7 0.0
2 20190203 4.6 -0.3
3 20190204 2.9 -0.5
4 20190205 6.2 1.3
5 20190206 7.5 2.4
6 20190207 8.6 4.6
7 20190208 8.6 5.0
8 20190209 9.2 6.7
9 20190210 9.1 3.8
10 20190211 6.9 0.7
11 20190212 7.0 -0.5
12 20190213 7.8 -0.5
13 20190214 13.4 0.0
14 20190215 16.4 2.0
15 20190216 14.8 2.0
16 20190217 15.7 1.2
17 20190218 15.4 1.2
18 20190219 9.8 4.3
19 20190220 11.1 2.8
20 20190221 13.1 5.8
21 20190222 10.7 4.1
22 20190223 12.9 1.5
23 20190224 14.5 1.2
24 20190225 16.1 2.2
25 20190226 17.2 0.3
26 20190227 19.3 1.1
27 20190228 11.3 5.1
should become
DAY TX TN
0 1 4.9 -0.6
1 2 2.7 0.0
2 3 4.6 -0.3
3 4 2.9 -0.5
4 5 6.2 1.3
5 6 7.5 2.4
6 7 8.6 4.6
7 8 8.6 5.0
8 9 9.2 6.7
9 10 9.1 3.8
10 11 6.9 0.7
11 12 7.0 -0.5
12 13 7.8 -0.5
13 14 13.4 0.0
14 15 16.4 2.0
15 16 14.8 2.0
16 17 15.7 1.2
17 18 15.4 1.2
18 19 9.8 4.3
19 20 11.1 2.8
20 21 13.1 5.8
21 22 10.7 4.1
22 23 12.9 1.5
23 24 14.5 1.2
24 25 16.1 2.2
25 26 17.2 0.3
26 27 19.3 1.1
27 28 11.3 5.1
I have to treat each value of this column so I can also check that there is no day missing and that the generation of numbers adapts to each month-df I will provide.
I searched in the Pandas documentation for an instruction that could help but I didn't find it.
Any help would be appreciated.

Use to_datetime with Series.dt.day:
df['DAY'] = pd.to_datetime(df['DAY'], format='%Y%m%d').dt.day
Another solution is casting values to strings, get last 2 integers by indexing and cast to integers:
df['DAY'] = df['DAY'].astype(str).str[-2:].astype(int)
print (df)
DAY TX TN
0 1 4.9 -0.6
1 2 2.7 0.0
2 3 4.6 -0.3
3 4 2.9 -0.5
4 5 6.2 1.3
5 6 7.5 2.4
6 7 8.6 4.6
7 8 8.6 5.0
8 9 9.2 6.7
9 10 9.1 3.8
10 11 6.9 0.7
11 12 7.0 -0.5
12 13 7.8 -0.5
13 14 13.4 0.0
14 15 16.4 2.0
15 16 14.8 2.0
16 17 15.7 1.2
17 18 15.4 1.2
18 19 9.8 4.3
19 20 11.1 2.8
20 21 13.1 5.8
21 22 10.7 4.1
22 23 12.9 1.5
23 24 14.5 1.2
24 25 16.1 2.2
25 26 17.2 0.3
26 27 19.3 1.1
27 28 11.3 5.1

You can just slice the column to get the last 2 digits and cast to int:
In[85]:
df['DAY'] = df['DAY'].str[-2:].astype(int)
df
Out[85]:
DAY TX TN
0 1 4.9 -0.6
1 2 2.7 0.0
2 3 4.6 -0.3
3 4 2.9 -0.5
4 5 6.2 1.3
5 6 7.5 2.4
6 7 8.6 4.6
7 8 8.6 5.0
8 9 9.2 6.7
9 10 9.1 3.8
10 11 6.9 0.7
11 12 7.0 -0.5
12 13 7.8 -0.5
13 14 13.4 0.0
14 15 16.4 2.0
15 16 14.8 2.0
16 17 15.7 1.2
17 18 15.4 1.2
18 19 9.8 4.3
19 20 11.1 2.8
20 21 13.1 5.8
21 22 10.7 4.1
22 23 12.9 1.5
23 24 14.5 1.2
24 25 16.1 2.2
25 26 17.2 0.3
26 27 19.3 1.1
27 28 11.3 5.1
If the dtype is int already then you just need to cast to str first:
df['DAY'] = df['DAY'].astype(str).str[-2:].astype(int)

how can i convert time-series data to numpy array keep time order in pandas?

i have below dataframe belows .. and i wanna convert it to numpy array.
when i tried.. time order is broken converted to numpy array.
may it's because it is time-series data (19:00~0:00:00~07:00:00)
how can i keep time-order convert dataframe to numpy array?
aaa \
Date 2015-12-06 2015-12-13 2015-12-20 2015-12-23 2015-12-26 2016-01-03
Time
19:00:00 4.72 8.50 3.87 7.95 1.76 9.82
19:15:00 4.54 8.00 3.72 8.14 1.74 9.77
19:30:00 4.44 8.17 3.72 7.99 1.75 9.77
19:45:00 4.37 7.92 3.28 7.94 1.89 9.61
20:00:00 4.03 7.54 2.48 7.99 1.98 9.46
20:15:00 3.74 7.86 3.30 7.68 1.63 9.30
20:30:00 3.48 8.41 3.52 7.88 1.52 9.22
20:45:00 3.31 8.52 3.81 7.83 1.54 9.08
21:00:00 3.17 8.23 3.97 7.96 1.63 9.14
21:15:00 2.99 8.23 3.37 7.61 1.87 9.14
21:30:00 2.96 8.26 3.23 7.63 2.03 9.13
21:45:00 2.69 7.89 3.10 7.34 2.12 9.04
22:00:00 2.62 7.83 2.94 7.21 2.11 9.04
22:15:00 2.55 7.78 2.83 7.26 2.39 9.01
22:30:00 2.49 7.73 2.89 7.15 2.30 9.08
22:45:00 2.48 7.80 2.79 7.02 2.22 8.92
23:00:00 2.38 7.71 2.92 7.17 2.43 8.80
23:15:00 2.23 7.74 3.01 7.24 2.33 8.56
23:30:00 2.29 7.51 3.10 7.14 2.38 8.32
23:45:00 2.29 7.31 3.00 6.89 2.10 8.02
00:00:00 2.17 6.84 2.84 6.89 1.82 7.86
00:15:00 2.13 6.84 2.65 7.06 1.36 7.95
00:30:00 2.21 6.78 2.63 6.98 0.92 7.97
00:45:00 2.19 6.41 2.18 7.08 1.05 7.80
01:00:00 2.13 6.24 1.56 7.20 0.81 7.73
01:15:00 2.14 5.90 1.39 7.31 1.01 7.89
01:30:00 2.13 5.74 1.81 7.58 0.79 7.91
01:45:00 2.11 5.82 1.60 7.47 1.19 8.02
02:00:00 1.72 6.01 0.90 7.14 1.27 8.09
02:15:00 1.94 6.04 1.12 7.33 0.95 8.13
02:30:00 2.05 6.00 1.44 7.06 1.15 8.15
02:45:00 1.96 6.03 1.45 6.86 1.05 7.95
03:00:00 1.63 6.28 1.62 6.85 1.22 7.43
03:15:00 1.79 6.14 1.41 6.94 1.05 6.97
03:30:00 1.37 6.03 1.29 6.98 1.27 6.97
03:45:00 1.44 5.84 1.01 7.29 1.31 6.90
04:00:00 1.37 5.62 0.92 7.13 1.35 6.77
04:15:00 1.62 5.75 0.95 7.18 1.21 7.09
04:30:00 1.64 5.71 1.06 7.18 1.32 7.27
04:45:00 1.40 5.46 0.79 7.17 1.55 7.35
05:00:00 1.51 5.48 0.64 6.83 1.42 7.27
05:15:00 1.46 5.80 0.52 6.58 1.60 7.21
05:30:00 1.61 5.59 0.35 6.98 1.54 7.13
05:45:00 1.49 5.28 0.46 6.58 1.58 7.04
06:00:00 1.55 5.00 0.17 6.35 1.88 7.10
06:15:00 1.94 4.94 -0.18 6.12 1.94 7.11
06:30:00 1.45 5.01 -0.31 6.02 1.90 7.14
06:45:00 1.36 4.90 -0.17 5.83 2.06 7.17
07:00:00 1.25 4.75 0.20 5.70 2.35 7.18

You need transpose DataFrame by T and convert to array:
arr = df.T.values
Or first convert to array and then transpose:
arr = df.values.T

numpy savetxt different cols different format output

I want to use np.savetxt(file,array,fmt='%8.1f') to save as txt
1958 6.4 1.8 7.7 70.1 41.4 38.5 65.4 25.7
1959 27.2 42.5 63.3 86.2 101.5 71.4 114.2 137.9
1960 22.9 18.3 28.7 106.5 159.1 50.4 203 121.6
1961 4.4 26.9 47.1 67.9 53.6 64.8 95 42
1962 20.9 31.2 60.6 38.8 66.2 37.9 67.9 62.3
1963 11.9 14.5 59 56 83.1 110.9 77.1 93.5
each element take up 8 spaces one by one(no seperation between each one).
First cols year format is %8d, and others is %8.1f. flush right.
How to do this in numpy? or using pandas?

n = len(df.columns)
fmt = ('{:8.0f}' + '{:8.1f}' * (n - 1)).format
print(df.apply(lambda x: fmt(*x), 1).to_csv(index=None, header=None))
1958 6.4 1.8 7.7 70.1 41.4 38.5 65.4 25.7
1959 27.2 42.5 63.3 86.2 101.5 71.4 114.2 137.9
1960 22.9 18.3 28.7 106.5 159.1 50.4 203.0 121.6
1961 4.4 26.9 47.1 67.9 53.6 64.8 95.0 42.0
1962 20.9 31.2 60.6 38.8 66.2 37.9 67.9 62.3
1963 11.9 14.5 59.0 56.0 83.1 110.9 77.1 93.5

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

awk: print values of one field on another field - awk

Related

Pandas Boxplot Highlight Specific Values in DF

I need to group a result set

Treat with the two last digits (dd) of each row in a 'date' (yyyymmdd) column in Pandas df

how can i convert time-series data to numpy array keep time order in pandas?

numpy savetxt different cols different format output

Categories

Resources