How to plot distribution of final grade by sex? - pandas

I'm working on predicting student performance based on various different factors. This is a link to my data: https://archive.ics.uci.edu/ml/datasets/Student+Performance#. This is a sample of the observations from the sex and final grade data columns:
sex G3
F 6
F 6
F 10
F 15
F 10
M 15
M 11
F 6
M 19
M 15
F 9
F 12
M 14
I'm looking at the distribution of my target variable (final grade):
ax= sns.kdeplot(data=df2, x="G3", shade=True)
ax.set(xlabel= 'Final Grade', ylabel= 'Density', title= 'Distribution of Final Grade')
plt.xlim([0, 20])
plt.show()
Screenshot of Distribution of Final Grade
And now I want to find out how the distribution of final grades differ by sex:
How can I do this?

Considering the sample data.
df2 = pd.DataFrame({'sex': ['F','F','F','F','F','M','M','F','M','M','F','F','M'], 'grades': [6,6,10,15,10,15,11,6,19,15,9,12,14]})
sex G3
F 6
F 6
F 10
F 15
F 10
M 15
M 11
F 6
M 19
M 15
F 9
F 12
M 14
We use the seaborn countplot function as follows.
sns.countplot(x="grades", hue='sex', data=df2)
To get the following plot.

Related

keep all column after sum and groupby including empty values

I have the following dataframe:
source name cost other_c other_b
a a 7 dd 33
b a 6 gg 44
c c 3 ee 55
b a 2
d b 21 qw 21
e a 16 aq
c c 10 55
I am doing a sum of name and source with:
new_df = df.groupby(['source', 'name'], as_index=False)['cost'].sum()
but it is dropping the remaining 6 columns in my dataframe. Is there a way to keep the rest of the columns? I'm not looking to add new column, just carry over the columns from the original dataframe

How to convert grades into points using pandas?

My code returns an error when I run it. Why might this be so?
import pandas as pd
df1 = pd.read_csv('sample.csv')
points = [0,1,2,3,4,5,6,7,8,9,10,11,12]
bins = ['X','E','D-','D','D+','C-','C','C+','B-','B','B+','A-','A']
df1['DA'] = pd.cut(df1.AA,bins,labels=points)
df1['DE'] = pd.cut(df1['BB'],bins,labels=points)
df1['CDI'] = pd.cut(df1.CC,bins,labels=points)
The error
ValueError: could not convert string to float: 'X'
EDITS
Those are student grades that I want to convert to points. Like grade A is 12 points in that order...
You can try using replace instead. First create a dict with the conversion you want to apply, then you can create your columns
# Sample DataFrame
df = pd.DataFrame({'AA': ['X','E','D-','D','D+','C-','C','C+','B-','B','B+','A-','A']})
# conversion dict
points = [0,1,2,3,4,5,6,7,8,9,10,11,12]
grades = ['X','E','D-','D','D+','C-','C','C+','B-','B','B+','A-','A']
conversion = dict(zip(grades, points))
# applying conversion
df['DA'] = df.AA.replace(conversion)
The DataFrame will now look like:
AA DA
0 X 0
1 E 1
2 D- 2
3 D 3
4 D+ 4
5 C- 5
6 C 6
7 C+ 7
8 B- 8
9 B 9
10 B+ 10
11 A- 11
12 A 12

Concatenating csv files side by side in pandas [duplicate]

This question already has answers here:
How to merge two dataframes side-by-side?
(6 answers)
Closed 2 years ago.
I have 6 csv files like this:
A,5601093669
C,714840722
D,3311821086
E,3714631762
F,2359322409
G,4449445373
H,1321142307
I,3403144346
K,2941319082
L,5982421765
M,1431041943
N,2289666237
P,2944809622
Q,2266749163
R,3503618053
S,3995185703
T,3348978524
V,4184646229
W,790817778
Y,1747887712
And I would like to concatenate them side by side, ie:
A,5601093669,5601093669,5601093669,5601093669...
C,714840722,714840722,714840722,714840722 ...
D,3311821086,3311821086,3311821086,3311821086...
Or even make a data frame directly like this:
Letters Counts1 Counts2 Counts3 ...
0 A 949038913 949038913 949038913 ...
1 C 154135114 154135114 154135114 ...
.
.
.
I tried to use pandas, but I only concatenate them one over the other like this:
Letters Counts
0 A 949038913
1 C 154135114
2 D 602309784
3 E 672070230
4 F 430604264
5 G 760092523
6 H 242152981
7 I 608218717
8 K 558412515
9 L 1057894498
10 N 455966669
11 M 238551663
12 P 554657856
13 Q 423767129
14 R 650581191
15 S 819127381
16 T 632469374
17 V 717790671
18 W 144439568
19 Y 324996779
20 A 5601093669
21 C 714840722
22 D 3311821086
23 E 3714631762
24 F 2359322409
25 G 4449445373
26 H 1321142307
27 I 3403144346
28 K 2941319082
29 L 5982421765
30 M 1431041943
31 N 2289666237
the code was like this:
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
df_from_each_file = [pd.read_csv(f, header=None, names=['Latters', 'Counts'])
for f in all_filenames]
frame = pd.concat(df_from_each_file, axis=0, ignore_index=True)
Any tip or improvement would be very welcome!
Thank you by your time!
Paulo
use axis=1 in pd.concat function
you can refer https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

How can I run a vlookup function in SQL within the same table?

I'm fairly new to SQL and struggling to find a good way to run the following query.
I have a table that looks something like this:
NAME JOB GRADE MANAGER NAME
X 7 O
Y 6 X
Z 5 X
A 4 Z
B 3 Z
C 2 Z
In this table, it shows that Y and Z report into X, and A, B and C report into Z.
I want to create a computed column showing the grade each person's most senior direct report or "n/a" if they don't manage anyone. So that would look something like this:
NAME JOB GRADE MANAGER NAME GRADE OF MOST SENIOR REPORT
X 7 O 6
Y 6 X N/A
Z 5 X 4
A 4 Z N/A
B 3 Z N/A
C 2 Z N/A
How would I do this?
SELECT g.*,isnull(convert(nvarchar, (SELECT max(g2.GRADE)
FROM dbo.Grade g2 WHERE
g2.manager =g.NAME AND g2.NAME!=g.NAME )),'N/A') as most_graded
FROM dbo.Grade g
The max will find out the topmost graded
Input
X 7 O
y 6 X
Z 5 X
A 6 Z
C 2 Z
Output
X 7 O 6
y 6 X N/A
Z 5 X 6
A 6 Z N/A
C 2 Z N/A
Something like this:
select name, job_grade, manager_name,
(select max(job_grade) from grades g2
where g2.manager_name = g1.name) as grade_of_most_recent_senior
from grades g1
order by name;
The above is ANSI SQL and should work on any DBMS.
SQLFiddle example: http://sqlfiddle.com/#!15/e0806/1

How to subtract one dataframe from another?

First, let me set the stage.
I start with a pandas dataframe klmn, that looks like this:
In [15]: klmn
Out[15]:
K L M N
0 0 a -1.374201 35
1 0 b 1.415697 29
2 0 a 0.233841 18
3 0 b 1.550599 30
4 0 a -0.178370 63
5 0 b -1.235956 42
6 0 a 0.088046 2
7 0 b 0.074238 84
8 1 a 0.469924 44
9 1 b 1.231064 68
10 2 a -0.979462 73
11 2 b 0.322454 97
Next I split klmn into two dataframes, klmn0 and klmn1, according to the value in the 'K' column:
In [16]: k0 = klmn.groupby(klmn['K'] == 0)
In [17]: klmn0, klmn1 = [klmn.ix[k0.indices[tf]] for tf in (True, False)]
In [18]: klmn0, klmn1
Out[18]:
( K L M N
0 0 a -1.374201 35
1 0 b 1.415697 29
2 0 a 0.233841 18
3 0 b 1.550599 30
4 0 a -0.178370 63
5 0 b -1.235956 42
6 0 a 0.088046 2
7 0 b 0.074238 84,
K L M N
8 1 a 0.469924 44
9 1 b 1.231064 68
10 2 a -0.979462 73
11 2 b 0.322454 97)
Finally, I compute the mean of the M column in klmn0, grouped by the value in the L column:
In [19]: m0 = klmn0.groupby('L')['M'].mean(); m0
Out[19]:
L
a -0.307671
b 0.451144
Name: M
Now, my question is, how can I subtract m0 from the M column of the klmn1 sub-dataframe, respecting the value in the L column? (By this I mean that m0['a'] gets subtracted from the M column of each row in klmn1 that has 'a' in the L column, and likewise for m0['b'].)
One could imagine doing this in a way that replaces the the values in the M column of klmn1 with the new values (after subtracting the value from m0). Alternatively, one could imagine doing this in a way that leaves klmn1 unchanged, and instead produces a new dataframe klmn11 with an updated M column. I'm interested in both approaches.
If you reset the index of your klmn1 dataframe to be that of the column L, then your dataframe will automatically align the indices with any series you subtract from it:
In [1]: klmn1.set_index('L')['M'] - m0
Out[1]:
L
a 0.777595
a -0.671791
b 0.779920
b -0.128690
Name: M
Option #1:
df1.subtract(df2, fill_value=0)
Option #2:
df1.subtract(df2, fill_value=None)