Conditional grouping in pandas and transpose - dataframe

With an input dataframe framed out of a given CSV, I need to transpose the data based on certain conditions. The groupby should be applied based on Key value.
For any value in the same 'Key' group, if the 'Type' is "T", these values should be written on "T" columns labelled as T1, T2, T3...and so on.
For any value in the same 'Key' group, if the 'Type' is "P" and 'Code' ends with "00" these values should be written on "U" columns labelled as U1, U2, U3...and so on.
For any value in the same 'Key' group, if the 'Type' is "P" and 'Code' doesn't end with "00" these values should be written on "P" columns labelled as P1, P2, P3...and so on.
There might be n number of values of type T & P for any Key value and the output columns for T & P should be updated accordingly
Input Dataframe:
df = pd.DataFrame({'Key': ['1', '1', '1', '1', '1', '2', '2', '2', '2', '2'],
'Value': ['T101', 'T102', 'P101', 'P102', 'P103', 'T201', 'T202', 'P201', 'P202', 'P203'],
'Type': ['T', 'T', 'P', 'P', 'P', 'T', 'T', 'P', 'P', 'P'],
'Code': ['0', '0', 'ABC00', 'TWY01', 'JTH02', '0', '0', 'OUJ00', 'LKE00', 'WDF45']
})
Expected Dataframe:
Can anyone suggest an effective solution for this case?

Here's a possible solution using pivot.
import pandas as pd
df = pd.DataFrame({'Key': ['1', '1', '1', '1', '1', '2', '2', '2', '2', '2'],
'Value': ['T101', 'T102', 'P101', 'P102', 'P103', 'T201', 'T202', 'P201', 'P202', 'P203'],
'Type': ['T', 'T', 'P', 'P', 'P', 'T', 'T', 'P', 'P', 'P'],
'Code': ['0', '0', 'ABC00', 'TWY01', 'JTH02', '0', '0', 'OUJ00', 'LKE00', 'WDF45']
})
# Set up the U label
df.loc[(df['Code'].apply(lambda x: x.endswith('00'))) & (df['Type'] == 'P'), 'Type'] = 'U'
# Type indexing by key by type
df = df.join(df.groupby(['Key','Type']).cumcount().rename('Tcount').to_frame() + 1)
df['Type'] = df['Type'] + df['Tcount'].astype('str')
# Pivot the table
pv =df.loc[:,['Key','Type','Value']].pivot(index='Key', columns='Type', values='Value')
>>>pv
Type P1 P2 T1 T2 U1 U2
Key
1 P102 P103 T101 T102 P101 NaN
2 P203 NaN T201 T202 P201 P202
cdf = df.loc[df['Code'] != '0', ['Key', 'Code']].groupby('Key')['Code'].apply(lambda x: ','.join(x))
>>>cdf
Key
1 ABC00,TWY01,JTH02
2 OUJ00,LKE00,WDF45
Name: Code, dtype: object
>>>pv.join(cdf)
P1 P2 T1 T2 U1 U2 Code
Key
1 P102 P103 T101 T102 P101 None ABC00,TWY01,JTH02
2 P203 None T201 T202 P201 P202 OUJ00,LKE00,WDF45

Related

How can I print a list and an int next to each other

Im trying to print the 'word results' and the 'number results' next to each other without spaces but unfortunately everything that I've tried hasn't worked and it will only print it out vertically.
import random
user_Input = input('Strong Or Weak?: ')
wrds = ['p', 'e', 'T', 'U', 'S', 'C', 'v', 'Q', 't', 'V', 'I', 'R', 'K', 'A', 'G', 'l', 'r', 'u', 'b', 'P', 'p', 'n', 'H', 'i', 'R', 'I', 'w', 'K', 'v', 'F', 'J', 'y', 'B', 'h', 'o', 'a', 'G', 'X', 'z']
rndm_num = random.randint(9999,99999)
rndm_wrds = random.sample(wrds , k = 8 )
result_wrds= rndm_wrds
result_num = rndm_num
if user_Input == 'Strong' or 'strong':
print(*result_wrds, sep ='') , print(result_num)
If you want the results in the same line, you can use the same print() satetment.
if user_Input == 'Strong' or 'strong':
print(*result_wrds, result_num, sep ='')
# CunvvzpI52080

How to apply str.split() on pandas column?

Using Simple Data:
df = pd.DataFrame({'ids': [0,1,2], 'value': ['2 4 10 0 14', '5 91 19 20 0', '1 1 1 2 44']})
I need to convert the column to array, so I use:
df.iloc[:,-1] = df.iloc[:,-1].apply(lambda x: str(x).split())
X = df.iloc[:, 1:]
X = np.array(X.values)
but the problem is the data is being nested and I just need a matrix (3,5). How to make this properly and fast for large data (avoid looping)?
As said in the comments by #anky, #ScottBoston. You can use string method split along with expand parameter and finally change to NumPy:
df.iloc[:, 1].str.split(expand=True).values
array([['2', '4', '10', '0', '14'],
['5', '91', '19', '20', '0'],
['1', '1', '1', '2', '44']], dtype=object)

Combine index header row and column header row in Pandas

I create a dataframe and export to an html table. However the headers are off as below
How can I combine the index name row, and the column name row?
I want the table header to look like this:
but it currently exports to html like this:
I create the dataframe as below (example):
data = [{'Name': 'A', 'status': 'ok', 'host': '1', 'time1': '2020-01-06 06:31:06', 'time2': '2020-02-06 21:10:00'}, {'Name': 'A', 'status': 'ok', 'host': '2', 'time1': '2020-01-06 06:31:06', 'time2': '-'}, {'Name': 'B', 'status': 'Alert', 'host': '1', 'time1': '2020-01-06 10:31:06', 'time2': '2020-02-06 21:10:00'}, {'Name': 'B', 'status': 'ok', 'host': '2', 'time1': '2020-01-06 10:31:06', 'time2': '2020-02-06 21:10:00'},{'Name': 'B', 'status': 'ok', 'host': '4', 'time1': '2020-01-06 10:31:06', 'time2': '2020-02-06 21:10:00'},{'Name': 'C', 'status': 'Alert', 'host': '2', 'time1': '2020-01-06 10:31:06', 'time2': '2020-02-06 21:10:00'},{'Name': 'C', 'status': 'ok', 'host': '3', 'time1': '2020-01-06 10:31:06', 'time2': '2020-02-06 21:10:00'},{'Name': 'C', 'status': 'ok', 'host': '4', 'time1': '-', 'time2': '-'}]
df = pandas.DataFrame(data)
df.set_index(['Name', 'status', 'host'], inplace=True)
html_body = df.to_html(bold_rows=False)
The index is set to have hierarchical rows, for easier reading in an html table:
print(df)
time1 time2
Name status host
A ok 1 2020-01-06 06:31:06 2020-02-06 21:10:00
2 2020-01-06 06:31:06 -
B Alert 1 2020-01-06 10:31:06 2020-02-06 21:10:00
ok 2 2020-01-06 10:31:06 2020-02-06 21:10:00
4 2020-01-06 10:31:06 2020-02-06 21:10:00
C Alert 2 2020-01-06 10:31:06 2020-02-06 21:10:00
ok 3 2020-01-06 10:31:06 2020-02-06 21:10:00
4 - -
The only solution that I've got working is to set every column to index.
This doesn't seem practical tho, and leaves an empty row that must be manually removed:
Setup
import pandas as pd
from IPython.display import HTML
l0 = ('Foo', 'Bar')
l1 = ('One', 'Two')
ix = pd.MultiIndex.from_product([l0, l1], names=('L0', 'L1'))
df = pd.DataFrame(1, ix, [*'WXYZ'])
HTML(df.to_html())
BeautifulSoup
Hack the HTML result from df.to_html(header=False). Pluck out the empty cells in the table head and drop in the column names.
from bs4 import BeautifulSoup
html_doc = df.to_html(header=False)
soup = BeautifulSoup(html_doc, 'html.parser')
empty_cols = soup.find('thead').find_all(lambda tag: not tag.contents)
for tag, col in zip(empty_cols, df):
tag.string = col
HTML(soup.decode_contents())
If you want to use a Dataframe Styler to perform a lot of wonderful formatting on your table, the elements, and the contents, then you might need a slight change to piRSquared's answer, as I did.
before transformation
style.to_html() added non-breaking spaces which made tag.contents always return true, and thus yielded no change to the table. I modified the lambda to account for this, which revealed another issue.
lambda tag: (not tag.contents) or '\xa0' in tag.contents
Cells were copied strangely
Styler.to_html() lacks the header kwarg - I am guessing this is the source of the issue. I took a slightly different approach - Move the second row headers into the first row, and then destroy the second header row.
It seems pretty generic and reusable for any multi-indexed dataframe.
df_styler = summary_df.style
# Use the df_styler to change display format, color, alignment, etc.
raw_html = df_styler.to_html()
soup = BeautifulSoup(raw_html,'html.parser')
head = soup.find('thead')
trs = head.find_all('tr')
ths0 = trs[0].find_all(lambda tag: (not tag.contents) or '\xa0' in tag.contents)
ths1 = trs[1].find_all(lambda tag: (tag.contents) or '\xa0' not in tag.contents)
for blank, filled in zip(ths0, ths1):
blank.replace_with(filled)
trs[1].decompose()
final_html_str = soup.decode_contents()
Success - two header rows condensed into one
Big Thanks to piRSquared for the starting point of Beautiful soup!

How to write a Kusto query to select only the rows that have unique values in one field

Having this input:
let t1 = datatable(id:string, col1:string, col2:string)
[
'1', 'col1_1', 'col2_1',
'2', 'col1_2', 'col2_2',
'3', 'col1_3', 'col2_3',
'4', 'col1_4', 'col2_4',
'1', 'col1_1', 'col2_11',
];
t1
| distinct id, col1
I need a query that will select only rows with unique values in "id" field. I understand that there are two possible outputs:
Output 1:
'1', 'col1_1', 'col2_1',
'2', 'col1_2', 'col2_2',
'3', 'col1_3', 'col2_3',
'4', 'col1_4', 'col2_4',
Output 2:
'2', 'col1_2', 'col2_2',
'3', 'col1_3', 'col2_3',
'4', 'col1_4', 'col2_4',
'1', 'col1_11', 'col2_11',
You can make use of any() aggregate function to pick up the col1 and col2 values based on unique values in 'id' column.
let t1 = datatable(id:string, col1:string, col2:string)
[
'1', 'col1_1', 'col2_1',
'2', 'col1_2', 'col2_2',
'3', 'col1_3', 'col2_3',
'4', 'col1_4', 'col2_4',
'1', 'col1_1', 'col2_11',
];
t1
| summarize any(col1), any(col2) by id
Would this work for your needs?
let t1 = datatable(id:string, col1:string, col2:string)
[
'1', 'col1_1', 'col2_1',
'2', 'col1_2', 'col2_2',
'3', 'col1_3', 'col2_3',
'4', 'col1_4', 'col2_4',
'1', 'col1_1', 'col2_11',
];
t1
| summarize col1 = make_set( col1 ), col2 = make_set( col2 ) by id

how to avoid sub-query to gain performance

i have a reporting query which have 2 long sub-query
SELECT r1.code_centre, r1.libelle_centre, r1.id_equipe, r1.equipe, r1.id_file_attente,
r1.libelle_file_attente,r1.id_date, r1.tranche, r1.id_granularite_de_periode,r1.granularite,
r1.ContactsTraites, r1.ContactsenParcage, r1.ContactsenComm, r1.DureeTraitementContacts,
r1.DureeComm, r1.DureeParcage, r2.AgentsConnectes, r2.DureeConnexion, r2.DureeTraitementAgents,
r2.DureePostTraitement
FROM
( SELECT cc.id_centre_contact, cc.code_centre, cc.libelle_centre, a.id_equipe, a.equipe,
a.id_file_attente, f.libelle_file_attente, a.id_date, g.tranche, g.id_granularite_de_periode,
g.granularite, sum(Nb_Contacts_Traites) as ContactsTraites,
sum(Nb_Contacts_en_Parcage) as ContactsenParcage,
sum(Nb_Contacts_en_Communication) as ContactsenComm,
sum(Duree_Traitement/1000) as DureeTraitementContacts,
sum(Duree_Communication / 1000 + Duree_Conference / 1000 + Duree_Com_Interagent / 1000) as DureeComm,
sum(Duree_Parcage/1000) as DureeParcage
FROM agr_synthese_activite_media_fa_agent a, centre_contact cc,
direction_contact dc, granularite_de_periode g, media m, file_attente f
WHERE m.id_media = a.id_media
AND cc.id_centre_contact = a.id_centre_contact
AND a.id_direction_contact = dc.id_direction_contact
AND dc.direction_contact ='INCOMING'
AND a.id_file_attente = f.id_file_attente
AND m.media = 'PHONE'
AND ( ( g.valeur_min = date_format(a.id_date,'%d/%m') and g.granularite = 'Jour')
or ( g.granularite = 'Heure' and a.id_th_heure = g.id_granularite_de_periode) )
GROUP by cc.id_centre_contact, a.id_equipe, a.id_file_attente, a.id_date, g.tranche,
g.id_granularite_de_periode) r1,
(
(SELECT cc.id_centre_contact,cc.code_centre, cc.libelle_centre, a.id_equipe, a.equipe,
a.id_date, g.tranche, g.id_granularite_de_periode,g.granularite,
count(distinct a.id_agent) as AgentsConnectes,
sum(Duree_Connexion / 1000) as DureeConnexion,
sum(Duree_en_Traitement / 1000) as DureeTraitementAgents,
sum(Duree_en_PostTraitement / 1000) as DureePostTraitement
FROM activite_agent a, centre_contact cc, granularite_de_periode g
WHERE ( g.valeur_min = date_format(a.id_date,'%d/%m') and g.granularite = 'Jour')
AND cc.id_centre_contact = a.id_centre_contact
GROUP BY cc.id_centre_contact, a.id_equipe, a.id_date, g.tranche, g.id_granularite_de_periode )
UNION
(SELECT cc.id_centre_contact,cc.code_centre, cc.libelle_centre, a.id_equipe, a.equipe,
a.id_date, g.tranche, g.id_granularite_de_periode,g.granularite,
count(distinct a.id_agent) as AgentsConnectes,
sum(Duree_Connexion / 1000) as DureeConnexion,
sum(Duree_en_Traitement / 1000) as DureeTraitementAgents,
sum(Duree_en_PostTraitement / 1000) as DureePostTraitement
FROM activite_agent a, centre_contact cc, granularite_de_periode g
WHERE ( g.granularite = 'Heure'
AND a.id_th_heure = g.id_granularite_de_periode)
AND cc.id_centre_contact = a.id_centre_contact
GROUP BY cc.id_centre_contact,a.id_equipe, a.id_date, g.tranche, g.id_granularite_de_periode)
) r2
WHERE r1.id_centre_contact = r2.id_centre_contact
AND r1.id_equipe = r2.id_equipe AND r1.id_date = r2.id_date
AND r1.tranche = r2.tranche AND r1.id_granularite_de_periode = r2.id_granularite_de_periode
GROUP BY r1.id_centre_contact , r1.id_equipe, r1.id_file_attente,
r1.id_date, r1.tranche, r1.id_granularite_de_periode
ORDER BY r1.code_centre, r1.libelle_centre, r1.equipe,
r1.libelle_file_attente, r1.id_date, r1.id_granularite_de_periode,r1.tranche
the EXPLAIN shows
| id | select_type | table | type| possible_keys | key | key_len | ref| rows | Extra |
'1', 'PRIMARY', '<derived3>', 'ALL', NULL, NULL, NULL, NULL, '2520', 'Using temporary; Using filesort'
'1', 'PRIMARY', '<derived2>', 'ALL', NULL, NULL, NULL, NULL, '4378', 'Using where; Using join buffer'
'3', 'DERIVED', 'a', 'ALL', 'fk_Activite_Agent_centre_contact', NULL, NULL, NULL, '83433', 'Using temporary; Using filesort'
'3', 'DERIVED', 'g', 'ref', 'Index_granularite,Index_Valeur_min', 'Index_Valeur_min', '23', 'func', '1', 'Using where'
'3', 'DERIVED', 'cc', 'ALL', 'PRIMARY', NULL, NULL, NULL, '6', 'Using where; Using join buffer'
'4', 'UNION', 'g', 'ref', 'PRIMARY,Index_granularite', 'Index_granularite', '23', '', '24', 'Using where; Using temporary; Using filesort'
'4', 'UNION', 'a', 'ref', 'fk_Activite_Agent_centre_contact,fk_activite_agent_TH_heure', 'fk_activite_agent_TH_heure', '5', 'reporting_acd.g.Id_Granularite_de_periode', '2979', 'Using where'
'4', 'UNION', 'cc', 'ALL', 'PRIMARY', NULL, NULL, NULL, '6', 'Using where; Using join buffer'
NULL, 'UNION RESULT', '<union3,4>', 'ALL', NULL, NULL, NULL, NULL, NULL, ''
'2', 'DERIVED', 'g', 'range', 'PRIMARY,Index_granularite,Index_Valeur_min', 'Index_granularite', '23', NULL, '389', 'Using where; Using temporary; Using filesort'
'2', 'DERIVED', 'a', 'ALL', 'fk_agr_synthese_activite_media_fa_agent_centre_contact,fk_agr_synthese_activite_media_fa_agent_direction_contact,fk_agr_synthese_activite_media_fa_agent_file_attente,fk_agr_synthese_activite_media_fa_agent_media,fk_agr_synthese_activite_media_fa_agent_th_heure', NULL, NULL, NULL, '20903', 'Using where; Using join buffer'
'2', 'DERIVED', 'cc', 'eq_ref', 'PRIMARY', 'PRIMARY', '4', 'reporting_acd.a.Id_Centre_Contact', '1', ''
'2', 'DERIVED', 'f', 'eq_ref', 'PRIMARY', 'PRIMARY', '4', 'reporting_acd.a.Id_File_Attente', '1', ''
'2', 'DERIVED', 'dc', 'eq_ref', 'PRIMARY', 'PRIMARY', '4', 'reporting_acd.a.Id_Direction_Contact', '1', 'Using where'
'2', 'DERIVED', 'm', 'eq_ref', 'PRIMARY', 'PRIMARY', '4', 'reporting_acd.a.Id_Media', '1', 'Using where'
don't know it very clear, but i think is the problem of seems it take full scaning
than i change all the sub-query to views(create view as select sub-query), and the result is the same
thanks for any advice
Subquery and view will most of the time give you same result speed-wise.
If your subquery is not variabe, consider creating table that has same structure as your view, and occasionally do:
truncate table my_table;
insert into my_table select * from my_view;
...to cache your subquery data. If properly indexed, it will marginalize time lost on storing results of subquery in table, if data is not that frequently changed, or at least if you don't need up-to-date information on second-to-second basis.