Gather several column using value as number of new rows - sql

I want to convert a somewhat dirty table into a normalized one. The structure of the table is as follow:
CREATE TABLE dirty_table(
date DATE NOT NULL
,name VARCHAR(24) NOT NULL
,co BIT NOT NULL
,en BIT NOT NULL
,re BIT NOT NULL
,po BIT NOT NULL
,ga BIT NOT NULL
,pr BIT NOT NULL
,bi INTEGER NOT NULL
);
Somewhat similar to this question but with a caveat, I have a bit/integer instead for values in a true/false fashion, bit columns can contain values 0 and 1, and the bi column any positive number and 0. I want to create a new row keeping name and date column and the name of the non zero column. Something like this:
date |name |proc |
-----------|----------|-----|
2017-07-04 |Jonny doe |bi |
2017-07-04 |Jonny doe |bi |
2017-07-07 |Jonny doe |ga |
2017-07-04 |Jonny doe |po |
2017-07-04 |Jonda doe |en |
2017-07-04 |Jonda doe |co |
2017-07-07 |Jonda doe |re |
2017-07-07 |Jonda doe |re |
2017-08-03 |Jonda doe |re |
2017-08-08 |Josep doe |en |
2017-08-09 |Josep doe |bi |
2017-08-11 |Josep doe |ga |
As can be seen, the bi column can appear several times if the value is >1. Others, unless there's another row, are likely to have only one combination of date, name and proc column, as seen in this excerpt of dirty_table:
date name co en re po ga pr bi
2017-07-03 DPSUW 1 1 0 0 0 0 2
2017-07-03 XDUPT 1 0 0 0 0 0 0
2017-07-03 XIYUD 0 1 0 0 0 0 1
2017-07-03 HBJRL 1 1 0 0 0 0 2
2017-07-03 DIHMP 1 1 0 0 0 0 1
2017-07-04 MTHDT 1 1 0 0 0 0 2
2017-07-04 MFPLI 0 1 0 0 0 0 1
2017-07-04 GKHFG 1 0 0 0 0 0 1
2017-07-04 QKDNE 1 1 0 0 0 0 2
2017-07-04 GSXLN 1 1 0 0 0 0 2
2017-07-05 ICKUT 0 1 0 0 0 0 1
2017-07-05 NHVLT 0 1 0 0 0 0 1
2017-07-05 KTSFX 1 1 0 0 0 0 1
2017-07-05 AINSA 1 1 0 0 0 0 2
2017-07-07 YUCAU 0 1 0 0 0 0 1
2017-07-07 YLLVX 1 0 0 0 0 0 1
2017-07-10 CSIMK 1 1 0 0 0 0 2
2017-07-10 PWNCV 0 1 0 0 0 0 1
2017-07-10 AMMVX 0 1 0 0 0 0 1
2017-07-11 BLELT 0 1 0 0 0 0 1
2017-07-11 ONAKD 0 1 0 0 0 0 1
2017-07-11 IGJDK 1 0 0 0 0 0 1
2017-07-11 TOQLH 1 1 0 0 0 0 2
2017-07-11 DUQWM 1 0 0 0 0 0 0
2017-07-11 SFWVP 1 1 0 0 0 0 2
2017-07-12 MQVHW 0 1 0 0 0 0 1
2017-07-12 OFHWQ 0 1 0 0 0 0 1
2017-07-12 MPOAK 1 1 0 0 0 0 1
2017-07-12 YPFEH 1 1 0 0 0 0 1
2017-07-12 XUENE 1 0 0 0 0 0 1
I was trying to use case statements but that only creates a single row. How can I create multiple rows from one record using the value as number of new rows to create? I prefer using generic SQL, but I'm using MariaDB.

The simplest method is probably union all:
select date, name, 'co' as proc from t where co >= 1 union all
select date, name, 'en' as proc from t where en >= 1 union all
. . .
select date, name, 'bi' as proc from t where bi >= 1 union all
select date, name, 'bi' as proc from t where bi >= 2;
That the multiple rows for bi.

Related

getting dummy values acorss all columns

get dummies method does not seem to work as expected while using with more than one column.
For e.g. if I have this dataframe...
shopping_list = [
["Apple", "Bread", "Fridge"],
["Rice", "Bread", "Milk"],
["Apple", "Rice", "Bread", "Milk"],
["Rice", "Milk"],
["Apple", "Bread", "Milk"],
]
df = pd.DataFrame(shopping_list)
If I use get_dummmies method, the items are repeated across columns like this:
pd.get_dummies(df)
0_Apple 0_Rice 1_Bread 1_Milk 1_Rice 2_Bread 2_Fridge 2_Milk 3_Milk
0 1 0 1 0 0 0 1 0 0
1 0 1 1 0 0 0 0 1 0
2 1 0 0 0 1 1 0 0 1
3 0 1 0 1 0 0 0 0 0
4 1 0 1 0 0 0 0 1 0
While the expected result is:
Apple Bread Fridge Milk Rice
0 1 1 1 0 0
1 0 1 0 1 1
2 1 1 0 1 1
3 0 0 0 1 1
4 1 1 0 1 0
Add parameters prefix and prefix_sep to get_dummies and then add max for avoid duplicated columns names (it aggregate by max):
df = pd.get_dummies(df, prefix='', prefix_sep='').max(axis=1, level=0)
print(df)
Apple Rice Bread Milk Fridge
0 1 0 1 0 1
1 0 1 1 1 0
2 1 1 1 1 0
3 0 1 0 1 0
4 1 0 1 1 0

Find not duplicated indice of dataframe for same index by pandas?

Example:
| param_a | param_b | param_c
1 | 0 | 0 | 0
1 | 0 | 2 | 1
3 | 2 | 1 | 1
4 | 0 | 2 | 1
3 | 2 | 1 | 1
4 | 0 | 0 | 0
4 | 0 | 0 | 0
For duplicated index(1,3,4), I want to find them where each indice is different. Take index 1 and 4 for example, there are different indices.
Output:
param_a param_b param_c
1 0 0 0
1 0 2 1
4 0 2 1
4 0 0 0
Notice: it returns unique indices for each duplicated index.
I refered this post but could not get the answer.
IIUC, using tuple , after reset_index get all value in the df as group key , then we filter the df by transform nunique , and then drop_duplicates
s=df.reset_index()
yourdf=s[s.apply(tuple, 1).groupby(s['index']).transform('nunique') > 1].\
drop_duplicates().\
set_index('index')
yourdf
Out[207]:
param_a param_b param_c
index
1 0 0 0
1 0 2 1
4 0 2 1
4 0 0 0
First convert index to column and remove duplicates by DataFrame.drop_duplicates and then get all duplicates per column index by Series.duplicated with keep=False and boolean indexing:
df = df.reset_index().drop_duplicates()
print (df)
index param_a param_b param_c
0 1 0 0 0
1 1 0 2 1
2 3 2 1 1
3 4 0 2 1
6 4 0 0 0
print (df['index'].duplicated(keep=False))
0 True
1 True
2 False
3 True
6 True
Name: index, dtype: bool
df1 = df[df['index'].duplicated(keep=False)].set_index('index').rename_axis(None)
print (df1)
param_a param_b param_c
1 0 0 0
1 0 2 1
4 0 2 1
4 0 0 0
I tried this way with duplicated: (There is also parameter keep to keep the duplicates or no):
df=df.reset_index()
mask = pd.DataFrame(np.sort(df[list(df)], axis=1), index=df.index).duplicated()
df1 = df[~mask]
df1=df1.set_index('index')
param_a param_b param_c
1 0 0 0
1 0 2 1
3 2 1 1
4 0 2 1
3 2 1 1
4 0 0 0
4 0 0 0
index param_a param_b param_c
0 1 0 0 0
1 1 0 2 1
2 3 2 1 1
3 4 0 2 1
4 3 2 1 1
5 4 0 0 0
6 4 0 0 0
param_a param_b param_c
index
1 0 0 0
1 0 2 1
3 2 1 1
4 0 2 1
4 0 0 0
If you try to keep the duplicates:
mask = pd.DataFrame(np.sort(df[list(df)], axis=1), index=df.index).duplicated(keep=False)
You will end in result:
param_a param_b param_c
index
1 0 0 0
1 0 2 1
4 0 2 1
Which is again close but it is not taking the duplicated row because there :
4 0 0 0
In account since it has a duplicate row (with that index 4) and it should because there is another row with starting index 4.
So this was close, but it is straight forward approach.

How to create dummy variables on Ordinal columns in Python

I am new to Python. I have created dummy columns on categorical column using pandas get_dummies. How to create dummy columns on ordinal column (say column Rating has values 1,2,3...,10)
Consider the dataframe df
df = pd.DataFrame(dict(Cats=list('abcdcba'), Ords=[3, 2, 1, 0, 1, 2, 3]))
df
Cats Ords
0 a 3
1 b 2
2 c 1
3 d 0
4 c 1
5 b 2
6 a 3
pd.get_dummies
works the same on either column
with df.Cats
pd.get_dummies(df.Cats)
a b c d
0 1 0 0 0
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1
4 0 0 1 0
5 0 1 0 0
6 1 0 0 0
with df.Ords
0 1 2 3
0 0 0 0 1
1 0 0 1 0
2 0 1 0 0
3 1 0 0 0
4 0 1 0 0
5 0 0 1 0
6 0 0 0 1
with both
pd.get_dummies(df)
Ords Cats_a Cats_b Cats_c Cats_d
0 3 1 0 0 0
1 2 0 1 0 0
2 1 0 0 1 0
3 0 0 0 0 1
4 1 0 0 1 0
5 2 0 1 0 0
6 3 1 0 0 0
Notice that it split out Cats but not Ords
Let's expand on this by adding another Cats2 column and calling pd.get_dummies
pd.get_dummies(df.assign(Cats2=df.Cats)))
Ords Cats_a Cats_b Cats_c Cats_d Cats2_a Cats2_b Cats2_c Cats2_d
0 3 1 0 0 0 1 0 0 0
1 2 0 1 0 0 0 1 0 0
2 1 0 0 1 0 0 0 1 0
3 0 0 0 0 1 0 0 0 1
4 1 0 0 1 0 0 0 1 0
5 2 0 1 0 0 0 1 0 0
6 3 1 0 0 0 1 0 0 0
Interesting, it splits both object columns but not the numeric one.

Truth table with 5 inputs and 3 outputs

I have to make a truth table with 5 inputs and 3 outputs, something like this:
A B C D E red green blue
0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 1
0 0 0 1 0 0 0 1
.
.
.
.
1 1 0 1 0 0 1 1
.
.
.
1 1 1 1 1 1 0 1
etc. (in total 32 rows, the numbers in the rgb table represents the number of 1's in each row in binary i.e in row 1 1 0 1 0 there are three 1's, so three in binary is 0 1 1).
I would like to present the result of it in the Atanua (http://sol.gfxile.net/atanua/index.html) tool (so fore example when I press button E, the blue light will shine, when pressing A B D the green and blue light will shine and so on). But there is a requirement that I can only use AND, OR, NOT operands, and each operand can only have two inputs. Although I'm using Karnaugh map to minimize it, still for so many records the results for each output are very long (especially for the last one).
I tried to simplify it more by adding all of the three output boolean functions into one, and the minimization process ended pretty well:
A + B + C + D
It seems to work fine (but as there is only one output light, it works only in red green blue column separately). My concern is the fact that I would like to have three outputs (three lights, not one), and is that even possible after this kind of minimization? Is there a good solution to do it in Atanua? Or do I have to make 3 separate boolean functions, no matter how long they will be (and there is a lot of them even after minimization)?
EDIT: the whole truth table :)
A B C D E R G B
0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 1
0 0 0 1 0 0 0 1
0 0 0 1 1 0 1 0
0 0 1 0 0 0 0 1
0 0 1 0 1 0 1 0
0 0 1 1 0 0 1 0
0 0 1 1 1 0 1 1
0 1 0 0 0 0 0 1
0 1 0 0 1 0 1 0
0 1 0 1 0 0 1 0
0 1 0 1 1 0 1 1
0 1 1 0 0 0 1 0
0 1 1 0 1 0 1 1
0 1 1 1 0 0 1 1
0 1 1 1 1 1 0 0
1 0 0 0 0 0 0 1
1 0 0 0 1 0 1 0
1 0 0 1 0 0 1 0
1 0 0 1 1 0 1 1
1 0 1 0 0 0 1 0
1 0 1 0 1 0 1 1
1 0 1 1 0 0 1 1
1 0 1 1 1 1 0 0
1 1 0 0 0 0 1 0
1 1 0 0 1 0 1 1
1 1 0 1 0 0 1 1
1 1 0 1 1 1 0 0
1 1 1 0 0 0 1 1
1 1 1 0 1 1 0 0
1 1 1 1 0 1 0 0
1 1 1 1 1 1 0 1
And the karnaugh map for each color (~is the gate NOT, * is AND, + OR):
RED:
BCDE+ACDE+ABDE+ABCE+ABCD
GREEN:
~A~BDE+~AC~DE+~ACD~E+~BCD~E+~AB~CE+B~CD~E+BC~D~E+A~B~CE+A~B~CD+A~BC~D+AB~C~D
BLUE:
~A~B~C~DE+~A~B~CD~E+~A~BC~D~E+~A~BCDE+~AB~C~D~E+~AB~CDE+~ABC~DE+~ABCD~E+A~B~C~D~E+A~B~CDE+A~BC~DE+A~BCD~E+AB~C~DE+AB~CD~E+ABC~D~E+ABCDE
Have to admit that the formulas are somewhat ugly, but it's not too complicated to implement with logic gatters, because you can reuse parts.
A -----+------+------------- - - -
NOT |
+------|--AND- ~AB
| | |
AND-----|---|-- ~A~B
+--AND-+ |
| +--|---|-- A~B
NOT AND--|-- AB
B -----+------+---+---------- - - -
Here as an example I created all combinations of [not]A and [not]B. You can do the same for C and D. So you can get any combination of [not]A and [not]B and [not]C and [not]D by combining a wire from each "box" with an and gatter (e.g. for ABCD we would take the AB wire AND the CD wire).

Matplotlib pcolor not plotting correctly

I am trying to create a heat map from a DataFrame (df) of IDs (rows) and Positions (columns) at which a motif is possible. If the motif is present the value of the table is 1 and 0 if it is not present. Such as:
ID Position 1 2 3 4 5 6 7 8 9 10 ...etc
A 0 1 0 0 0 1 0 0 0 1
B 1 0 1 0 1 0 0 1 0 0
C 0 0 0 1 0 0 1 0 1 0
D 1 0 1 0 0 0 1 0 1 0
I then multiply this matrix by itself to find the number of times the motifs present co-occur with motifs at other positions using the code:
df.T.dot(df)
To obtain the Data Frame:
POS 1 2 3 4 5 6 7 8 9 10 ...
1 2 0 2 0 1 0 1 1 1 0
2 0 1 0 0 0 1 0 0 0 1
3 2 0 2 0 1 0 1 1 1 0
4 0 0 0 1 0 0 1 0 1 0
5 1 0 1 0 1 0 0 1 0 0
6 0 1 0 0 0 1 0 0 0 1
7 1 0 1 1 0 0 2 0 2 0
8 1 0 1 0 1 0 0 1 0 0
9 1 0 1 1 0 0 2 0 2 0
10 0 1 0 0 0 1 0 0 0 1
...
Which is symmetrical with the diagonal, however when I try to create the Heat Map using
pylab.pcolor(df)
It gives me an asymmetrical map that does not seem to be representing the dotted matrix. I don't have enough reputation to post an image though.
Does anyone know why this might be occurring? Thanks