How to use crosstab for multiple columns? - pandas

I need help using crosstab on the df below.
a b c
-------------------------
| a | None | c |
| a | b | None |
| None | b | c |
| a | None | None |
| None | None | None |
I want to pull rows where more than letter is specified (a&b, a&c, b&c) i.e. rows 1-3. I believe the easiest way to do this is through crosstab (I know I'll get a count but can I also view the rows through this method?). I want to avoid having to write a lengthy 'or' statement to acheive this.
Desired Output:
a b c
-------------------------
| a | None | c |
| a | b | None |
| None | b | c |

You aren't looking for crosstab, just check the number of non-nulls using notnull:
df[df.notnull().sum(1).gt(1)]
a b c
0 a NaN c
1 a b NaN
2 NaN b c
Or you can use dropna:
t = 2
df.dropna(thresh=df.shape[1] - t + 1)
a b c
0 a NaN c
1 a b NaN
2 NaN b c

Related

How to use PostgreSQL column value with some calculations as second column value

I am inserting a huge data set into the SQL table and I want to achieve something like this.
SELECT generate_series(1,10) AS a, mod(generate_series(11,20),3) as b,mod(generate_series(21,30),5) as c;
a | b | c
----+---+---
1 | 2 | 1
2 | 0 | 2
3 | 1 | 3
4 | 2 | 4
5 | 0 | 0
6 | 1 | 1
7 | 2 | 2
8 | 0 | 3
9 | 1 | 4
10 | 2 | 0
(10 rows)
The problem is I don't want to call generate_series function for b and c. I want to take the value of a and perform mod on it for b and c like below or even efficient way but I am unable to understand this how can I do it efficiently as I will be generating and saving 1 Million records.
SELECT generate_series(1,10) AS a, mod(a,3) as b,mod(a,5) as c;
Use a from clause:
SELECT a, mod(a, 3) as b, mod(a, 5) as c;
FROM generate_series(1,10) gs(a)

Find which values are in every group in pandas

Is there a way of aggregating or transforming in pandas that would give me the list of values that are present in each group.
For example, taking this data
+---------+-----------+
| user_id | module_id |
+---------+-----------+
| 1 | A |
| 1 | B |
| 1 | C |
| 2 | A |
| 2 | B |
| 2 | D |
| 3 | B |
| 3 | C |
| 3 | D |
| 3 | E |
+---------+-----------+
how would I complete this code
df.groupby('user_id')
to give the result C, the only module_id that is in each of the groups?
Use get_dummies with max for indicator DataFrame and then filter only 1 columns - 1 values are processes like Trues in DataFrame.all:
cols = (pd.get_dummies(df.set_index('user_id')['module_id'])
.max(level=0)
.loc[:, lambda x: x.all()].columns)
print (cols)
Index(['B'], dtype='object')
Similar solution:
df1 = pd.get_dummies(df.set_index('user_id')['module_id']).max(level=0)
print (df1)
A B C D E
user_id
1 1 1 1 0 0
2 1 1 0 1 0
3 0 1 1 1 1
cols = df1.columns[df1.all()]
More solutions:
cols = df.groupby(['module_id', 'user_id']).size().unstack().dropna().index
print (cols)
Index(['B'], dtype='object', name='module_id')
cols = df.pivot_table(index='module_id', columns='user_id', aggfunc='size').dropna().index

pandas pivot onto values

Given a dataframe
df=pd.DataFrame([[1,11,0],[1,12,1],[2,21,0],[2,22,1]])
df.columns = ['Key','Value','PivotOn']
pivoted = df.pivot(index='Key',columns='PivotOn',values='Value')
The pivot action will give me columns 0 and 1 from the column 'PivotOn'. But I would like to always pivot onto values 0, 1 and 2, even if there might not exist a row that has PivotOn = 2 (just produce nan for it).
I cannot modify original dataframe so I'd want something like:
pivoted = df.pivot(index='Key',columns=[0,1,2],values='Value')
where it will always produce 3 columns of 0, 1 and 2 and column 2 is filled with nans.
Assume PivotOn has three unique values 0, 1, 2
df=pd.DataFrame([[1,11,0],[1,12,1],[2,21,0],[2,22,2]])
df.columns = ['Key','Value','PivotOn']
df
+---+-----+-------+---------+
| | Key | Value | PivotOn |
+---+-----+-------+---------+
| 0 | 1 | 11 | 0 |
| 1 | 1 | 12 | 1 |
| 2 | 2 | 21 | 0 |
| 3 | 2 | 22 | 2 |
+---+-----+-------+---------+
And say you need to include columns 2, 3 and 4 (you can also assume that 2 may or may not be present in original df, so generalizing)
Then go as -
expected = {2, 3, 4}
res = list(expected - set(df.PivotOn.unique()))
if len(res) > 1:
new_df = pd.DataFrame({'Key':np.NaN, 'Value':np.NaN, 'PivotOn':res}, index=range(df.shape[0], df.shape[0] + len(res)))
ndf = pd.concat([df, new_df], sort=False)
pivoted = ndf.pivot(index='Key',columns='PivotOn',values='Value').dropna(how='all')
else:
pivoted = df.pivot(index='Key',columns='PivotOn',values='Value')
pivoted
+---------+------+------+------+-----+-----+
| PivotOn | 0 | 1 | 2 | 3 | 4 |
+---------+------+------+------+-----+-----+
| Key | | | | | |
| 1.0 | 11.0 | 12.0 | NaN | NaN | NaN |
| 2.0 | 21.0 | NaN | 22.0 | NaN | NaN |
+---------+------+------+------+-----+-----+
You might try this if all you need is a column '2' with nan's when they do not exist in your dataframe;
def no_col_2(df):
if 2 not in df['PivotOn']:
pivoted = df.pivot(index='Key',columns='PivotOn',values='Value')
pivoted['2'] = np.nan
else:
pivoted = df.pivot(index='Key',columns='PivotOn',values='Value')
no_col_2(df)
print(pivoted)
PivotOn 0 1 2
Key
1 11 12 NaN
2 21 22 NaN

Lexicographical sorting of a Postgres table column based on values of another column

I have a table, say initial_freq, in a PostgreSQL database (version 10.4):
initial | freq
---------+------
r | 20
s | 20
a | 10
m | 10
p | 7
k | 6
d | 5
n | 3
g | 3
c | 3
v | 3
b | 3
h | 2
y | 2
j | 2
i | 1
The requirement is that whenever there is a tie in the freq column,
the corresponding values in the initial column must be sorted
alphabetically.
The required output looks like this:
initial | freq
---------+------
r | 20
s | 20
a | 10
m | 10
p | 7
k | 6
d | 5
b | 3
c | 3
g | 3
n | 3
v | 3
h | 2
j | 2
y | 2
i | 1
This is a part of a large problem, most of which I have solved except this one.
I realize that this might be a dynamic programming problem, and I can solve it in other programming languages.
I am a complete novice in the SQL world. Any help will be much
appreciated.
Use ORDER BY to order by freq DESC and then by initial.
SELECT
*
FROM
your_table
ORDER BY
freq DESC, initial;

How to match variable data in SQL Server

I need to map a many-to-many relationship between two flat tables. Table A contains a list of possible configurations (where each column is a question and the cell value is the answer). NULL values denote that the answer does not matter. Table B contains actual configurations with the same columns.
Ultimately, I need the final results to show which configurations are mapped between table B and A:
Example
ActualId | ConfigId
---------+---------
5 | 1
6 | 1
8 | 2
. | .
. | .
. | .
N | M
To give a simple example of the tables and data I'm working with, the first table would look like such:
Table A
--------
ConfigId | Size | Color | Cylinders | ... | ColumnN
---------+------+-------+-----------+-----+--------
1 | 3 | | 4 | ... | 5
2 | 4 | 5 | 5 | ... | 5
3 | | 5 | | ... | 5
And Table B would look like this:
Table B
-------
ActualId | Size | Color | Cylinders | ... | ColumnN
---------+------+-------+-----------+-----+--------
1 | 3 | 1 | 4 | ... | 5
2 | 3 | 8 | 4 | ... | 5
3 | 4 | 5 | 5 | ... | 5
4 | 7 | 5 | 6 | ... | 5
Since the NULL values denote that any value can work, the expected result would be:
Expected
---------
ActualId | ConfigId
---------+---------
1 | 1
2 | 1
3 | 2
3 | 3
4 | 3
I'm trying to figure out the best way to go about matching the actual data which has over a hundred columns. I know trying to check each and every column for NULL values is absolutely wrong and will not perform well. I'm really fascinated with this problem and would love some help to find the best way to tackle this.
So, this joins table a on size, color and cylinders.
The size match will be A against B:
If A.SIZE is null, the compare will B.SIZE=B.SIZE which will always return true.
If A.SIZE is not null, the compare will be A.SIZE=B.SIZE which will only be true if they match.
The matching on color and cylinders are similar.
SELECT * FROM TABLEA A
INNER JOIN TABLEB B ON ISNULL(A.SIZE, B.SIZE)=B.SIZE
AND ISNULL(A.COLOR, B.COLOR)=B.COLOR
AND ISNULL(A.CYLINDERS, B.CYLINDERS)=B.CYLINDERS