How can i give a numeric order to each line based on unique ID - awk

I'm working on big data set and I need to give and print a numeric order for each unique ID ($1) and want to delete the lines above 335 numeric order for each unique ID.
The data looks like
101 24
101 13
101 15
102 25
102 21
102 23
103 20
103 12
103 18
The output looks like this
101 24 1
101 13 2
101 15 3
102 25 1
102 21 2
102 23 3
103 20 1
103 12 2
103 18 3

Try below one
Input
$ cat f
101 24
101 13
101 15
102 25
102 21
102 23
103 20
103 12
103 18
Output
$ awk '{print $0,++a[$1]}' f
101 24 1
101 13 2
101 15 3
102 25 1
102 21 2
102 23 3
103 20 1
103 12 2
103 18 3
If data is sorted ( column1 ) then use below one, faster
$ awk '$1!=p{n=0}{print $0,++n; p=$1}' f
101 24 1
101 13 2
101 15 3
102 25 1
102 21 2
102 23 3
103 20 1
103 12 2
103 18 3
To remove id above 335
$ awk '$1!=p{n=0; p=$1}++n<335{print $0,n}' f
$ awk '++a[$1]<335{print $0,a[$1]}' f

Related

fill gaps day with two tables in sql

I have three different ID. id are dynamics
FOR EACH id, i need complete a calendar with last exist value.
example
ID
VALUE
date
1
30
1/1/2020
1
29
3/1/2020
2
65
1/1/2020
3
30
2/1/2020
1
11
6/1/2020
2
40
4/1/2020
3
23
5/1/2020
OUTPUT EXPECTED
ID
VALUE
date
1
30
1/1/2020
1
30
2/1/2020
1
29
3/1/2020
1
29
4/1/2020
1
29
5/1/2020
1
11
6/1/2020
2
65
1/1/2020
2
65
2/1/2020
2
65
3/1/2020
2
40
4/1/2020
2
40
5/1/2020
2
40
6/1/2020
3
30
2/1/2020
3
30
3/1/2020
3
30
4/1/2020
3
23
5/1/2020
3
23
6/1/2020
---Complete fields until today for each id---

Pandas drop_duplicates with multiple conditions

I have some measurement datas that need to be filtered, I read them as dataframe data, like these:
df
RequestTime RequestID ResponseTime ResponseID
0 150 14 103 101
1 150 15 110 102
2 25 16 121 103
3 25 16 97 104
4 22 16 44 105
5 19 17 44 106
6 26 18 29 106
7 30 18 29 106
and I need to use two different conditions at the same time, that is, to filter 'RequestTime' 'RequestID' and 'ResponseTime' 'ResponseID' by use drop_duplicate(subset=) at the same time. I have used follow command to get the filter results for each of the two conditions:
>>>df[['RequestTime','RequestID','ResponseTime','ResponseID']].drop_duplicates(subset = ['ResponseTime','ResponseID'])
RequestTime RequestID ResponseTime ResponseID
0 150 14 103 101
1 150 15 110 102
2 25 16 121 103
4 22 16 44 105
5 19 17 44 106
6 26 18 29 106
7 30 18 29 106
>>>df[['RequestTime','RequestID','ResponseTime','ResponseID']].drop_duplicates(subset = ['RequestTime','RequestID'])
RequestTime RequestID ResponseTime ResponseID
0 150 14 103 101
1 150 15 110 102
2 25 16 121 103
3 25 16 97 104
4 22 16 44 105
5 19 17 44 106
6 26 18 29 106
but how to combine the two conditions to drop duplicate row 3 and row 7?
IIUC,
m = ~(df.duplicated(subset=['RequestTime','RequestID']) | df.duplicated(subset=['ResponseTime', 'ResponseID']))
df[m]
Output:
RequestTime RequestID ResponseTime ResponseID
0 150 14 103 101
1 150 15 110 102
2 25 16 121 103
4 22 16 44 105
5 19 17 44 106
6 26 18 29 106
Create a mask (boolean series) to boolean index your dataframe.
Or chain methods:
df.drop_duplicates(subset=['RequestTime', 'RequestID']).drop_duplicates(subset=['ResponseTime', 'ResponseID'])

re-arrange and plot data pandas

I have a data frame like the following:
days movements count
0 0 0 2777
1 0 1 51
2 0 2 2
3 1 0 6279
4 1 1 200
5 1 2 7
6 1 3 3
7 2 0 5609
8 2 1 110
9 2 2 32
10 2 3 4
11 3 0 4109
12 3 1 118
13 3 2 101
14 3 3 8
15 3 4 3
16 3 6 1
17 4 0 3034
18 4 1 129
19 4 2 109
20 4 3 6
21 4 4 2
22 4 5 2
23 5 0 2288
24 5 1 131
25 5 2 131
26 5 3 9
27 5 4 2
28 5 5 1
29 6 0 1918
30 6 1 139
31 6 2 109
32 6 3 13
33 6 4 1
34 6 5 1
35 7 0 1442
36 7 1 109
37 7 2 153
38 7 3 13
39 7 4 10
40 7 5 1
41 8 0 1085
42 8 1 76
43 8 2 111
44 8 3 13
45 8 4 7
46 8 7 1
47 9 0 845
48 9 1 81
49 9 2 86
50 9 3 8
51 9 4 8
52 10 0 646
53 10 1 70
54 10 2 83
55 10 3 1
56 10 4 2
57 10 5 1
58 10 6 1
This shows that for example on day 0, I have 2777 entries with 0 movements, 51 entries with 1 movement, 2 entries with 2 movements. I want to plot it as bar graph for every day and show the entries count for all movements. In order to do it, I thought I would transform the data to something like below and then plot a bar graph.
days 0 1 2 3 4 5 6 7
0 2777 51 2
1 6279 200 7 3
2 5609 110 32 4
3 4109 118 101 8 3
4 3034 129 109 6 2 2
5 2288 131 131 9 2 1
6 1918 139 109 13 1 1
7 1442 109 153 13 10 1
8 1085 76 111 13 7 1
9 845 81 86 8 8
10 646 70 83 1 2 1 1
I am not getting an idea of how should I achieve this? I have thousands of lines of data so doing it by hand does not make sense. Can someone guide me how to rearrange the data or if there is a quick way to plot the bar graph using matplotlib straight from the actual data frame that would be even better. Thanks for the help.

SQL - return the smallest value in one column that matches the value of another column

How can I return in column4 the smallest number from column3 based on column1?
1 100 1
2 100 1
1 101 2
1 102 4
2 200 19
3 200 19
16 200 19
18 200 19
19 200 19
20 200 19
3 301 28
6 301 28
3 302 29
3 310 30
4 400 31
4 410 32
4 420 33
5 500 34
7 500 34
5 510 35
6 510 35
5 520 36
6 610 37
7 700 38
7 701 39
8 701 39
8 800 40
8 802 41
Thank you!
Join to a subquery that calculates the minimums for each col1:
select a.col1, a.col2, a.col3, mcol3
from mytable a
join (select col1, min(col3) mcol3 from mytable group by col1) b
on b.col1 = a.col1
See SQLFiddle, showing this output from your sample data:
1 100 1 1
2 100 1 1
1 101 2 1
1 102 4 1
2 200 19 1
3 200 19 19
16 200 19 19
18 200 19 19
19 200 19 19
20 200 19 19
3 301 28 19
6 301 28 28
3 302 29 19
3 310 30 19
4 400 31 31
4 410 32 31
4 420 33 31
5 500 34 34
7 500 34 34
5 510 35 34
6 510 35 28
5 520 36 34
6 610 37 28
7 700 38 34
7 701 39 34
8 701 39 39
8 800 40 39
8 802 41 39

SQL- calculating daily stock levels for month as aggregate of availability

I have a table containing following records of stocks from different depots in region. this contains:
itemName
startDate
endDate
quantity
The fields are
key(pk)
itemName- numeric code
startDate- date
endDate- date
amt- number
Sample data with 3 item types
1 101 Jan 1, 2013 Jan 14, 2013 15
2 101 Jan 12, 2013 Jan 15, 2013 3
3 102 Jan 4, 2013 Jan 26, 2013 7
4 102 Jan 6, 2013 Jan 12, 2013 19
5 103 Jan 15, 2013 Jan 16, 2013 3
6 103 Jan 12, 2013 Jan 21, 2013 19
How do I write a query that will get the number of items of each time every day in this period? Essentially I need to have a query that will add up applicable items between startDate and endDate. Thanks
I would want a final query result to look like that would add overlaps for each item
Jan 1 101 15
Jan 1 102 0
Jan 12 101 18
Jan 15 101 3
Jan 16 101 3
while I know i can do for a given date
SELECT item, sum(amt)
FROM [table]
WHERE (date>=startdate) AND (date<=enddate)
GROUP BY item
How do I enable it iterate for the whole month(Jan 1st to 31st) to produce such a report?
Here's what you need to do:
Create a table named [DayNumbers] and fill it with the numbers from 1 through 31:
DayNumber
---------
1
2
3
...
30
31
Now create a saved query in Access named [MonthDates] to create a row for each day in a specified month:
PARAMETERS SelectedYear Long, SelectedMonth Long;
SELECT DateSerial([SelectedYear], [SelectedMonth], DayNumber) AS StatusDate
FROM DayNumbers
WHERE Month(DateSerial([SelectedYear], [SelectedMonth], DayNumber)) = [SelectedMonth];
Note that the WHERE clause restricts the number of days to the actual number of days in the month (e.g., 30 for April).
Create another saved query in Access named [StockStatusRows] to create a row for each day and each item
SELECT StatusDate, itemName
FROM
MonthDates,
(
SELECT DISTINCT itemName FROM StockData
) AS Items;
For test data in [StockStatus] that looks like
key itemName startDate endDate amt
--- -------- ---------- ---------- ---
1 101 2013-01-01 2013-01-14 15
2 101 2013-01-12 2013-01-15 3
3 102 2013-01-04 2013-01-26 7
4 102 2013-01-06 2013-01-12 19
5 103 2013-01-15 2013-01-16 3
6 103 2013-01-12 2013-01-21 19
7 101 2013-01-30 2013-02-03 6
8 102 2013-02-05 2013-02-23 9
9 103 2013-02-07 2013-03-02 11
the [StockStatusRows] query will return
StatusDate itemName
---------- --------
2013-01-01 101
2013-01-02 101
2013-01-03 101
..
2013-01-30 101
2013-01-31 101
2013-01-01 102
2013-01-02 102
2013-01-03 102
...
2013-01-30 102
2013-01-31 102
2013-01-01 103
2013-01-02 103
2013-01-03 103
...
2013-01-30 103
2013-01-31 103
Now we can pull together the actual stock values like so:
SELECT ssr.StatusDate, ssr.itemName, Nz(sums.total, 0) AS TotalOnHand
FROM
StockStatusRows AS ssr
LEFT JOIN
(
SELECT StatusDate, itemName, Sum(amt) AS total
FROM
(
SELECT md.StatusDate, sd.itemName, sd.amt
FROM
StockData sd
INNER JOIN
MonthDates md
ON md.StatusDate>=sd.startDate
And md.StatusDate<=sd.endDate
)
GROUP BY StatusDate, itemName
) AS sums
ON (sums.itemName=ssr.itemName)
AND (sums.StatusDate=ssr.StatusDate)
ORDER BY ssr.StatusDate, ssr.itemName;
returning
StatusDate itemName TotalOnHand
---------- -------- -----------
2013-01-01 101 15
2013-01-01 102 0
2013-01-01 103 0
2013-01-02 101 15
2013-01-02 102 0
2013-01-02 103 0
2013-01-03 101 15
2013-01-03 102 0
2013-01-03 103 0
2013-01-04 101 15
2013-01-04 102 7
2013-01-04 103 0
2013-01-05 101 15
2013-01-05 102 7
2013-01-05 103 0
2013-01-06 101 15
2013-01-06 102 26
2013-01-06 103 0
...
2013-01-12 101 18
2013-01-12 102 26
2013-01-12 103 19
2013-01-13 101 18
2013-01-13 102 7
2013-01-13 103 19
2013-01-14 101 18
2013-01-14 102 7
2013-01-14 103 19
2013-01-15 101 3
2013-01-15 102 7
2013-01-15 103 22
2013-01-16 101 0
2013-01-16 102 7
2013-01-16 103 22
2013-01-17 101 0
2013-01-17 102 7
2013-01-17 103 19
...
2013-01-22 101 0
2013-01-22 102 7
2013-01-22 103 0
...
2013-01-31 101 6
2013-01-31 102 0
2013-01-31 103 0
select itemName from <Table-Name> where startDate>=(start-date) startDate<=(end-date) and endDate>=(start-date) and endDate<=(end-date) group by itemName.
This would calculate the sum of quantities of each product between (start-date) and (end-date).
If you want just total of all items irrespective of their type,
select sum(quantity) from <Table-Name> where startDate>=(start-date) startDate<=(end-date) and endDate>=(start-date) and endDate<=(end-date)
Hope this helps