Concatenating csv files side by side in pandas [duplicate] - pandas

This question already has answers here:
How to merge two dataframes side-by-side?
(6 answers)
Closed 2 years ago.
I have 6 csv files like this:
A,5601093669
C,714840722
D,3311821086
E,3714631762
F,2359322409
G,4449445373
H,1321142307
I,3403144346
K,2941319082
L,5982421765
M,1431041943
N,2289666237
P,2944809622
Q,2266749163
R,3503618053
S,3995185703
T,3348978524
V,4184646229
W,790817778
Y,1747887712
And I would like to concatenate them side by side, ie:
A,5601093669,5601093669,5601093669,5601093669...
C,714840722,714840722,714840722,714840722 ...
D,3311821086,3311821086,3311821086,3311821086...
Or even make a data frame directly like this:
Letters Counts1 Counts2 Counts3 ...
0 A 949038913 949038913 949038913 ...
1 C 154135114 154135114 154135114 ...
.
.
.
I tried to use pandas, but I only concatenate them one over the other like this:
Letters Counts
0 A 949038913
1 C 154135114
2 D 602309784
3 E 672070230
4 F 430604264
5 G 760092523
6 H 242152981
7 I 608218717
8 K 558412515
9 L 1057894498
10 N 455966669
11 M 238551663
12 P 554657856
13 Q 423767129
14 R 650581191
15 S 819127381
16 T 632469374
17 V 717790671
18 W 144439568
19 Y 324996779
20 A 5601093669
21 C 714840722
22 D 3311821086
23 E 3714631762
24 F 2359322409
25 G 4449445373
26 H 1321142307
27 I 3403144346
28 K 2941319082
29 L 5982421765
30 M 1431041943
31 N 2289666237
the code was like this:
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
df_from_each_file = [pd.read_csv(f, header=None, names=['Latters', 'Counts'])
for f in all_filenames]
frame = pd.concat(df_from_each_file, axis=0, ignore_index=True)
Any tip or improvement would be very welcome!
Thank you by your time!
Paulo

use axis=1 in pd.concat function
you can refer https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

Related

the 'combine' of a split-apply-combine in pd.groupby() works brilliantly, but I'm not sure why

I have a fragment of code similar to below. It works perfectly, but I'm not sure why I am so lucky.
The groupby() is a split-apply-combine operation. So I understand why the qf.groupby(qf.g).mean() returns a series with two rows, the mean() for each of a,b.
And what's brilliant is that -combine step of the qf.groupby(qf.g).cumsum() reassembles all the rows into their original order as found in the starting df.
My question is, "Why am I able to count on this behavior?" I'm glad I can, but I cannot articulate why it's possible.
#split-apply-combine
import pandas as pd
#DF with a value, and an arbitrary category
qf= pd.DataFrame(data=[x for x in "aaabbaaaab"], columns=['g'])
qf['val'] = [1,2,3,1,2,3,4,5,6,9]
print(f"applying mean() to members in each group of a,b ")
print ( qf.groupby(qf.g).mean() )
print(f"\n\napplying cumsum() to members in each group of a,b ")
print( qf.groupby(qf.g).cumsum() ) #this combines them in the original index order thankfully
qf['running_totals'] = qf.groupby(qf.g).cumsum()
print (f"\n{qf}")
yields:
applying mean() to members in each group of a,b
val
g
a 3.428571
b 4.000000
applying cumsum() to members in each group of a,b
val
0 1
1 3
2 6
3 1
4 3
5 9
6 13
7 18
8 24
9 12
g val running_totals
0 a 1 1
1 a 2 3
2 a 3 6
3 b 1 1
4 b 2 3
5 a 3 9
6 a 4 13
7 a 5 18
8 a 6 24
9 b 9 12

How to plot distribution of final grade by sex?

I'm working on predicting student performance based on various different factors. This is a link to my data: https://archive.ics.uci.edu/ml/datasets/Student+Performance#. This is a sample of the observations from the sex and final grade data columns:
sex G3
F 6
F 6
F 10
F 15
F 10
M 15
M 11
F 6
M 19
M 15
F 9
F 12
M 14
I'm looking at the distribution of my target variable (final grade):
ax= sns.kdeplot(data=df2, x="G3", shade=True)
ax.set(xlabel= 'Final Grade', ylabel= 'Density', title= 'Distribution of Final Grade')
plt.xlim([0, 20])
plt.show()
Screenshot of Distribution of Final Grade
And now I want to find out how the distribution of final grades differ by sex:
How can I do this?
Considering the sample data.
df2 = pd.DataFrame({'sex': ['F','F','F','F','F','M','M','F','M','M','F','F','M'], 'grades': [6,6,10,15,10,15,11,6,19,15,9,12,14]})
sex G3
F 6
F 6
F 10
F 15
F 10
M 15
M 11
F 6
M 19
M 15
F 9
F 12
M 14
We use the seaborn countplot function as follows.
sns.countplot(x="grades", hue='sex', data=df2)
To get the following plot.

How to convert grades into points using pandas?

My code returns an error when I run it. Why might this be so?
import pandas as pd
df1 = pd.read_csv('sample.csv')
points = [0,1,2,3,4,5,6,7,8,9,10,11,12]
bins = ['X','E','D-','D','D+','C-','C','C+','B-','B','B+','A-','A']
df1['DA'] = pd.cut(df1.AA,bins,labels=points)
df1['DE'] = pd.cut(df1['BB'],bins,labels=points)
df1['CDI'] = pd.cut(df1.CC,bins,labels=points)
The error
ValueError: could not convert string to float: 'X'
EDITS
Those are student grades that I want to convert to points. Like grade A is 12 points in that order...
You can try using replace instead. First create a dict with the conversion you want to apply, then you can create your columns
# Sample DataFrame
df = pd.DataFrame({'AA': ['X','E','D-','D','D+','C-','C','C+','B-','B','B+','A-','A']})
# conversion dict
points = [0,1,2,3,4,5,6,7,8,9,10,11,12]
grades = ['X','E','D-','D','D+','C-','C','C+','B-','B','B+','A-','A']
conversion = dict(zip(grades, points))
# applying conversion
df['DA'] = df.AA.replace(conversion)
The DataFrame will now look like:
AA DA
0 X 0
1 E 1
2 D- 2
3 D 3
4 D+ 4
5 C- 5
6 C 6
7 C+ 7
8 B- 8
9 B 9
10 B+ 10
11 A- 11
12 A 12

Efficiently assigning values to multidimensional array based on indices list on one dimension

I have a matrix M of size [S1, S2, S3].
I have another matrix K that serves as the indices in the first dimension that I want to assign, with size [1, S2, S3].
And V is a [1, S2, S3] matrix which contains the values to be assigned correspondingly.
With for loops, this is how I did it:
for x2 = 1:S2
for x3 = 1:S3
M(K(1,x2,x3), x2, x3) = V(1, x2, x3)
endfor % x3
endfor % x2
Is there a more efficient way to do this?
Visualization for 2D case:
M =
1 4 7 10
2 5 8 11
3 6 9 12
K =
2 1 3 2
V =
50 80 70 60
Desired =
1 80 7 10
50 5 8 60
3 6 70 12
Test case:
M = reshape(1:24, [3,4,2])
K = reshape([2,1,3,2,3,3,1,2], [1,4,2])
V = reshape(10:10:80, [1,4,2])
s = size(M)
M = assign_values(M, K, V)
M =
ans(:,:,1) =
1 20 7 10
10 5 8 40
3 6 30 12
ans(:,:,2) =
13 16 70 22
14 17 20 80
50 60 21 24
I'm looking for an efficient way to implement assign_values there.
Running Gelliant's answer somehow gives me this:
key = sub2ind(s, K, [1:s(2)])
error: sub2ind: all subscripts must be of the same size
You can use sub2ind to use your individual subscripts to linear indices. These can then be used to replace them with the values in V.
M = [1 4 7 10 ;...
2 5 8 11 ;...
3 6 9 12];
s=size(M);
K = [2 1 3 2];
K = sub2ind(s,K,[1:s(2)])
V = [50 80 70 60];
M(K)=V;
You don't need reshape and M=M(:) for it to work in Matlab.
I found that this works:
K = K(:)'+(S1*(0:numel(K)-1));
M(K) = V;
Perhaps this is supposed to work the same way as Gelliant's answer, but I couldn't make his answer work, somehow =/

How to subtract one dataframe from another?

First, let me set the stage.
I start with a pandas dataframe klmn, that looks like this:
In [15]: klmn
Out[15]:
K L M N
0 0 a -1.374201 35
1 0 b 1.415697 29
2 0 a 0.233841 18
3 0 b 1.550599 30
4 0 a -0.178370 63
5 0 b -1.235956 42
6 0 a 0.088046 2
7 0 b 0.074238 84
8 1 a 0.469924 44
9 1 b 1.231064 68
10 2 a -0.979462 73
11 2 b 0.322454 97
Next I split klmn into two dataframes, klmn0 and klmn1, according to the value in the 'K' column:
In [16]: k0 = klmn.groupby(klmn['K'] == 0)
In [17]: klmn0, klmn1 = [klmn.ix[k0.indices[tf]] for tf in (True, False)]
In [18]: klmn0, klmn1
Out[18]:
( K L M N
0 0 a -1.374201 35
1 0 b 1.415697 29
2 0 a 0.233841 18
3 0 b 1.550599 30
4 0 a -0.178370 63
5 0 b -1.235956 42
6 0 a 0.088046 2
7 0 b 0.074238 84,
K L M N
8 1 a 0.469924 44
9 1 b 1.231064 68
10 2 a -0.979462 73
11 2 b 0.322454 97)
Finally, I compute the mean of the M column in klmn0, grouped by the value in the L column:
In [19]: m0 = klmn0.groupby('L')['M'].mean(); m0
Out[19]:
L
a -0.307671
b 0.451144
Name: M
Now, my question is, how can I subtract m0 from the M column of the klmn1 sub-dataframe, respecting the value in the L column? (By this I mean that m0['a'] gets subtracted from the M column of each row in klmn1 that has 'a' in the L column, and likewise for m0['b'].)
One could imagine doing this in a way that replaces the the values in the M column of klmn1 with the new values (after subtracting the value from m0). Alternatively, one could imagine doing this in a way that leaves klmn1 unchanged, and instead produces a new dataframe klmn11 with an updated M column. I'm interested in both approaches.
If you reset the index of your klmn1 dataframe to be that of the column L, then your dataframe will automatically align the indices with any series you subtract from it:
In [1]: klmn1.set_index('L')['M'] - m0
Out[1]:
L
a 0.777595
a -0.671791
b 0.779920
b -0.128690
Name: M
Option #1:
df1.subtract(df2, fill_value=0)
Option #2:
df1.subtract(df2, fill_value=None)