Filter table for abundance in at least 20 % of the samples - awk

I have a huge table tab separated like the one below:
the first row is the subject list while the other rows are my counts.
KEGGAnnotation a b c d e f g h i l m n o p q r s t u v z w ee wr ty yu im
K01824 0 0 1 5 0 0 0 0 0 0 0 0 0 0 14 6 0 0 0 0 0 0 0 0 0 0 0
K03924 17302 15372 19601 18732 17180 18094 23560 20516 14280 24187 19642 20521 20330 20843 22948 17124 19557 18319 16608 19463 18334 21022 14325 10819 13342 16876 16979
K13730 0 0 1 5 0 0 0 0 0 0 0 0 0 0 14 6 0 0 0 0 0 0 0 0 0 0 0
K13735 5360 463 12516 7235 5051 2022 2499 2778 5392 1220 6460 9490 1169 6556 14862 9657 7360 6837 7810 4368 2186 12474 7810 9755 1401 12867 4431
K07279 0 0 1 5 0 0 0 0 0 0 0 0 0 0 14 6 0 0 0 0 0 0 0 0 0 0 0
K14194 4499 2216 2322 2031 2763 2219 704 1647 2536 876 2692 4196 687 2958 3207 2153 2266 1974 370 2867 1110 5372 3637 9828 2038 2812 3472
K11494 0 0 1 10 0 0 0 0 11 0 0 0 0 0 14 6 0 0 0 0 0 0 0 0 0 0 0
K03332 0 0 1 5 0 0 0 0 0 0 0 0 0 0 14 6 0 0 0 0 0 0 0 0 0 0 0
K01317 3 1 6 0 1 3 0 14 11 0 21 8 0 20 0 263 0 0 6 3 5 0 0 41 0 0 2
I would like to grep only the lines in which the counts >100 are present in at least 20% of the samples (= in at least 6 samples).
EX. sample Ko3924 will be grepped but not K03332.

increment the counter for values greater than the threshold. Print the lines if the counter is greater than the 20% of the fields checked. This will also print the header line.
awk '{c=0; for(i=2;i<=NF;i++) c+=($i>=100); if(c>=0.2*(NF-1)) print $0}' input

Related

T-SQL poor CTE join performance

Long story short. I have a two SQL queries that are very similar and outputs value pairs of short and long names. Input parameter is for first query: short name and for second query is long name. The queries are constructed that outputs all rows that contained input parameters and this is not exact match (for example if I set tagname as ST2 this query outputs all object that contains ST2 and all other object that has ST2 at the beginning of its name.
All the queries are executed in same database. In below queries input parameters are set to "" this means that query outputs key value pairs of short and long names for all objects
declare #tagName as varchar(50) = ''
set #tagName = #tagName + '.'
-- this one query outputs ~700 rows
;with AnalogTag as
(
Select *
from [Runtime].[dbo].[AnalogTag]
where (substring(TagName, 0, charindex('.',#tagName))) in (substring(#tagName,0, len(#tagName)))
and (substring(TagName, charindex('.',TagName), 2)) not in ('.#')
),
-- this one query outputs ~7000 rows
HierarchicalName as
(
Select *
from [proba7].[dbo].[internal_list_objects_view]
where substring(tag_name, 0,len(#tagName)) = substring(#tagName,0, len(#tagName))
)
select HierarchicalName.tag_name as TagName
,HierarchicalName.hierarchical_name ilo_view_HierarchicalName
from AnalogTag
inner join HierarchicalName
on substring(AnalogTag.TagName, 0, CHARINDEX('.',AnalogTag.TagName)) = HierarchicalName.tag_name
Whole query above runs at approx 3 seconds. And outputs about 450 rows
I created a similar one query on same database:
declare #hierarchicalName as varchar(200) = ''
declare #Length as int
set #Length = LEN(#hierarchicalName)+1
-- this query outputs approx 700 rows and if runs separately it runs
--almost instantly
;with AnalogTag as
(
Select TagName
from [Runtime].[dbo].[AnalogTag]
where (substring(TagName, 0, CHARINDEX('.',TagName))) in
(
Select tag_name from [proba7].[dbo].[internal_list_objects_view]
where substring(hierarchical_name, 0, #Length) = #hierarchicalName
)
and (substring(TagName, CHARINDEX('.',TagName), 2)) not in ('.#')
),
-- this query outputs approx 7000 rows and if runs separately it runs
--almost instantly
HierarchicalName as
(
Select hierarchical_name, tag_name from [proba7].[dbo].[internal_list_objects_view]
where substring(hierarchical_name, 0, #Length) = #hierarchicalName
)
select HierarchicalName.tag_name as ilo_view_TagName
,HierarchicalName.hierarchical_name ilo_view_HierarchicalName
from AnalogTag
inner join HierarchicalName
on substring(AnalogTag.TagName, 0, CHARINDEX('.',AnalogTag.TagName)) = HierarchicalName.tag_name
And this time query runs in 28 seconds. Outputs similar amount of row as first query (because ouptut must be similar on these two queries). I noticed that if i change "inner join" to for example "full join", query runs instantly.
Analog tag example output:
TagName a b c d e f g h j j k l m
PomFPTemp.PV 1062 0 10 0 10 0 4 0 0 0 0 0 0
PomFPWilgWzgl.PV 1 0 10 0 10 0 4 0 0 0 0 0 0
SNST3P1BrakFWHMocD3f.PV 1063 0 10 0 10 0 4 0 0 0 0 0 0
SNST3P1BrakFWHMocP3f.PV 46 0 10 0 10 0 4 0 0 0 0 0 0
SNST3P1BrakFWHMocQ3f.PV 1063 0 10 0 10 0 4 0 0 0 0 0 0
SNST3P1BrakFWHMocQ3fIntExp.PV 1063 0 10 0 10 0 4 0 0 0 0 0 0
SNST3P1BrakFWHMocQ3fIntImp.PV 1063 0 10 0 10 0 4 0 0 0 0 0 0
SNST3P1BrakFWHMocQn3f.PV 1063 0 10 0 10 0 4 0 0 0 0 0 0
SNST3P1EnkvarhExp3f.PV 1060 0 10 0 10 0 4 0 0 0 0 0 0
SNST3P1EnkvarhImp3f.PV 1060 0 10 0 10 0 4 0 0 0 0 0 0
hierarchical name example output:
ID tag_name contained_name hierarchical_name a b c d e f g h i j k l m n o p q r s t u v w x y z aa bb cc dd ee ff gg hh
1 $Galaxy $Galaxy 1 0 0 0 0 1 NULL NULL 0 0 1 1 1 0 0 0 0 0 1 0 NULL 23 0 0 0 0 0 0 0 0 0 133123 50026 1 NULL
2 proba7 Galaxy_001 0 0 0 0 1 1 NULL NULL 0 0 848 1 0 1 0 0 0 0 1 0 NULL 23 0 0 0 0 0 0 0 0 0 4098 878020 1 NULL
3 $_AutoImport $_AutoImport 1 0 0 0 0 3 NULL NULL 0 0 4 2 0 0 0 0 0 0 1 0 NULL 10 0 0 0 0 0 0 0 0 0 131075 3699 1 NULL
4 $_DiCommon $_DiCommon 1 0 0 0 0 4 NULL NULL 0 0 5 3 0 0 0 0 0 0 1 0 NULL 10 0 0 0 0 0 0 0 0 0 131075 50023 1 NULL
5 $WinPlatform $WinPlatform 1 0 0 0 0 5 NULL 1 0 0 6 4 1 0 0 0 0 0 0 0 NULL 1 0 0 0 0 0 0 0 0 0 133121 419340 1 1
6 $AppEngine $AppEngine 1 0 0 0 0 6 NULL 1 0 0 7 5 1 0 0 0 0 0 0 0 NULL 3 0 0 0 0 0 0 0 0 0 133121 419341 1 1
7 $Area $Area 1 0 0 0 0 7 NULL 1 0 0 8 6 1 0 0 0 0 0 0 0 NULL 13 0 0 0 0 0 0 0 0 0 133121 3452998 1 1
8 $AnalogDevice $AnalogDevice 1 0 0 0 0 8 NULL 2 0 0 9 7 0 0 0 0 0 0 0 0 NULL 10 0 0 0 0 0 0 0 0 0 131073 419343 1 2
9 $DDESuiteLinkClient $DDESuiteLinkClient 1 0 0 0 0 9 NULL 3 0 0 10 8 1 0 0 0 0 0 0 0 NULL 11 0 0 0 0 0 0 0 0 0 133121 419344 1 3
10 $DiscreteDevice $DiscreteDevice 1 0 0 0 0 10 NULL 2 0 0 11 9 0 0 0 0 0 0 0 0 NULL 10 0 0 0 0 0 0 0 0 0 131073 419345 1 2
11 $InTouchProxy $InTouchProxy 1 0 0 0 0 11 NULL 3 0 0 12 10 0 0 0 0 0 0 0 0 NULL 11 0 0 0 0 0 0 0 0 0 131073 419346 1 3
Output table (example):
ilo_view_TagName ilo_view_HierarchicalName
ST4FP12Rozl ST4_FP1.Galaz_2.Rozladowywanie
ST4FP21Rozl ST4_FP2.Galaz_1.Rozladowywanie
ST4FP22Rozl ST4_FP2.Galaz_2.Rozladowywanie
ST4FP31Rozl ST4_FP3.Galaz_1.Rozladowywanie
ST4RS41AnWspKFL2 ST4_S1_RS4_1.Wsp.K_Factor.L2
ST4FP32Rozl ST4_FP3.Galaz_2.Rozladowywanie
ST4RS31AnWspKFL2 ST4_S2_RS3_1.Wsp.K_Factor.L2
ST4RS51AnWspKFL2 ST4_S3_RS5_1.Wsp.K_Factor.L2
ST4FP11U ST4_FP1.Galaz_1.Napiecie
Best regards and thanks in advance for any advices. I tried at my best to make this exapmple tables readable.

Pandas Drop levels and attach to column title

Using pivot function I have managed to obtain flatten data frame:
q_id 1 2
a_id 1 2 3 4 5 6 7 8
movie_id user_id start_rating
931 284 2.0 0 0 0 1 0 0 0 0
804 648 4.5 0 0 0 0 1 0 0 0
840 414 4.5 0 1 0 0 0 0 0 0
843 419 3.5 1 0 0 0 0 1 0 0
848 132 3.5 1 0 0 1 0 0 0 0
My goal was to remove the indexes and attached level to the column name.
movie_id user_id start_rating 1_1 1_2 1_3 1_4 2_5 2_6 2_7 2_8
931 284 2.0 0 0 0 1 0 0 0 0
804 648 4.5 0 0 0 0 1 0 0 0
840 414 4.5 0 1 0 0 0 0 0 0
843 419 3.5 1 0 0 0 0 1 0 0
848 132 3.5 1 0 0 1 0 0 0 0
I tried following steps:
df.columns = ['_'.join(col).strip() for col in df.columns.values]
but getting:
df.columns = ['_'.join(col).strip() for col in df.columns.values]
TypeError: sequence item 0: expected string, int found
The function join works with strings, and the element of col are int as the error shows. You need to convert the element of col to str.
df.columns = ['_'.join([str(lev) for lev in col]).strip() for col in df.columns.values]
or because here you have two levels, do:
df.columns = ['{}_{}'.format(l1,l2) for l1, l2 in df.columns.values]

How to turn a list of event in to a matrix to display in Panda

I have a list of events and i want to display on a graph how many happens per hour each day of the week as shown below:
Example of the graph i want
(each line is a day, x axis is the time of the day, y axis is the number of events)
As i am new to Panda i am not sure what's the best way to do it but here is my way:
x = [(rts[k].getDay(), rts[k].getHour(), 1) for k in rts]
df = pd.DataFrame(x[:30]) # Subset of 30 events
dfGrouped = df.groupby([0, 1]).sum() # Group them by day and hour
#Format to display
pd.DataFrame(np.random.randn(24, 7), index=range(0,24), columns=['Mo', 'Tu', 'We', 'Th', 'Fr', 'Sa', 'Su'])
Question is, how can i go from my dataframe with data grouped to a matrix 24x7 as required to display ?
I tried as_matrix but that give me only a one dimensional array, while i want the index of my dataframe to be the index in my matrix.
print(df)
2
0 1
0 19 1
23 1
1 10 2
18 3
22 1
2 17 1
3 8 2
9 3
11 3
13 1
19 1
4 7 1
9 1
14 1
15 1
18 1
5 1 2
7 1
13 1
19 1
6 12 1
Thanks for your help :)
Antoine
I think you need unstack for reshape data, then rename columns names by dict and if necessary add missing hours to index by reindex_axis:
df1 = df.groupby([0, 1])[2].sum().unstack(0, fill_value=0)
#set columns names
df = pd.DataFrame(x[:30], columns = ['days','hours','val'])
d = {0: 'Mo', 1: 'Tu', 2: 'We', 3: 'Th', 4: 'Fr', 5: 'Sa', 6: 'Su'}
df1 = df.groupby(['days', 'hours'])['val'].sum().unstack(0, fill_value=0)
df1 = df1.rename(columns=d).reindex_axis(range(24), fill_value=0)
print (df1)
days Mo Tu We Th Fr Sa Su
hours
0 0 0 0 0 0 0 0
1 0 0 0 0 0 2 0
2 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0
7 0 0 0 0 1 1 0
8 0 0 0 2 0 0 0
9 0 0 0 3 1 0 0
10 0 2 0 0 0 0 0
11 0 0 0 3 0 0 0
12 0 0 0 0 0 0 1
13 0 0 0 1 0 1 0
14 0 0 0 0 1 0 0
15 0 0 0 0 1 0 0
16 0 0 0 0 0 0 0
17 0 0 1 0 0 0 0
18 0 3 0 0 1 0 0
19 1 0 0 1 0 1 0
20 0 0 0 0 0 0 0
21 0 0 0 0 0 0 0
22 0 1 0 0 0 0 0
23 1 0 0 0 0 0 0

How to combine rows or data into one row

I am working on a project to see how many units of each category that customers order and this is my Select clause:
SELECT
d2.customer_id
, ( CASE WHEN d2.category = 100 THEN d2.units ELSE 0 END ) AS produce_units
, ( CASE WHEN d2.category = 200 THEN d2.units ELSE 0 END ) AS meat_units
, ( CASE WHEN d2.category = 300 THEN d2.units ELSE 0 END ) AS seafood_units
, SUM (d2.units) AS total_units
And my result looks like this while 62779 is customer id and the last column is total units.
62779 0 0 0 0 20 0 0 0 0 0 0 20
62779 0 0 0 0 0 0 0 0 52 0 0 52
62779 0 6 0 0 0 0 0 0 0 0 0 6
62779 0 0 0 0 0 0 0 0 0 22 0 22
62779 0 0 0 0 0 14 0 0 0 0 0 14
62779 0 0 0 0 0 0 0 20 0 0 0 20
62779 0 0 0 8 0 0 0 0 0 0 0 8
62779 64 0 0 0 0 0 0 0 0 0 0 64
However, I want my result to look like this:
62779 64 6 0 8 20 14 0 20 52 22 0 206
Please advice. Thanks :)

Pivot table for datetime

I have the following table for pivoting.
Example:
Table:
create table testing
(
column_date datetime
)
Insertion of records:
insert into testing values('2014-11-07'),('2014-11-08'),
('2014-11-01'),('2014-11-02'),('2014-11-04');
Expected Result:
column_date 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
----------------------------------------------------------------------------------------------------------
2014-11-07 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2014-11-08 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2014-11-01 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2014-11-02 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2014-11-04 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Attempt:
select a.column_date,[01],[02],[03],[04],[05],[06],[07],[08],[09],[10],
[11],[12],[13],[14],[15],[16],[17],[18],[19],[20],
[21],[22],[23],[24],[25],[26],[27],[28],[29],[30],[31]
from
(
select column_date from testing
) as a
pivot
(
count(column_date)
for column_date in([01],[02],[03],[04],[05],[06],[07],[08],[09],[10],
[11],[12],[13],[14],[15],[16],[17],[18],[19],[20],
[21],[22],[23],[24],[25],[26],[27],[28],[29],[30],[31])
) pvt;
Error details:
Msg 8114, Level 16, State 1, Line 11
Error converting data type nvarchar to date.
Msg 473, Level 16, State 1, Line 11
The incorrect value "01" is supplied in the PIVOT operator.
Msg 4104, Level 16, State 1, Line 1
The multi-part identifier "a.column_date" could not be bound.
Try this. You need to use the Datepart to get the days of Month in SELECT query that produces the data and use it in Pivot to get the result..
SELECT column_date,
[01],[02],[03],[04],[05],[06],[07],[08],[09],
[10],[11],[12],[13],[14],[15],[16],[17],[18],
[19],[20],[21],[22],[23],[24],[25],[26],[27],
[28],[29],[30],[31]
FROM (SELECT column_date,
Datepart(dd, column_date) dd,
column_date AS ddate
FROM #testing) AS a
PIVOT ( Count(ddate)
FOR dd IN( [01],[02],[03],[04],[05],[06],[07],[08],[09],
[10],[11],[12],[13],[14],[15],[16],[17],[18],
[19],[20],[21],[22],[23],[24],[25],[26],[27],
[28],[29],[30],[31]) ) pvt;
OUTPUT
column_date 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
---------------------------------------------------------------------------------------------------------------------------------------------------
2014-11-01 00:00:00.000 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2014-11-02 00:00:00.000 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2014-11-04 00:00:00.000 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2014-11-07 00:00:00.000 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2014-11-08 00:00:00.000 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0