How to remove duplicated column in a dataframe - pandas

I need some help with my Data Wrangling in Python, using pandas.
I am appending dataframes in a loop (one or two rows per iteration). For an unknown reason I have duplicated columns when appending which occur only in a few iterations, resulting in the following dataframe:
ID Nom. value 0 1 2 3 ...
1001 -857 -856.989 -856.989 -857.042 -857.042 ...
1001 335 334.987 334.987 335.006 335.006 ...
...
Is there a way to do this systematically in the loop (dataframe is large and consists of many iterations)? I need this dataframe instead:
ID Nom. value 0 1 ...
1001 -857 -856.989 -857.042 ...
1001 335 334.987 335.006 ...
...
Thanks!

Related

Changing column name and it's values at the same time

Pandas help!
I have a specific column like this,
Mpg
0 18
1 17
2 19
3 21
4 16
5 15
Mpg is mile per gallon,
Now I need to replace that 'MPG' column to 'litre per 100 km' and change those values to litre per 100 km' at the same time. Any help? Thanks beforehand.
-Tom
I changed the name of the column but doing both simultaneously,i could not.
Use pop to return and delete the column at the same time and rdiv to perform the conversion (1 mpg = 1/235.15 liter/100km):
df['litre per 100 km'] = df.pop('Mpg').rdiv(235.15)
If you want to insert the column in the same position:
df.insert(df.columns.get_loc('Mpg'), 'litre per 100 km',
df.pop('Mpg').rdiv(235.15))
Output:
litre per 100 km
0 13.063889
1 13.832353
2 12.376316
3 11.197619
4 14.696875
5 15.676667
An alternative to pop would be to store the result in another dataframe. This way you can perform the two steps at the same time. In my code below, I first reproduce your dataframe, then store the constant for conversion and perform it on all entries using the apply method.
df = pd.DataFrame({'Mpg':[18,17,19,21,16,15]})
cc = 235.214583 # constant for conversion from mpg to L/100km
df2 = pd.DataFrame()
df2['litre per 100 km'] = df['Mpg'].apply(lambda x: cc/x)
print(df2)
The output of this code is:
litre per 100 km
0 13.067477
1 13.836152
2 12.379715
3 11.200694
4 14.700911
5 15.680972
as expected.

Search values in a Pandas DataFrame with values from another DataFrame

I have 2 dataframes.
df_dora
content
feature
id
1
cyber hygien
risk management
1
2
cyber risk
risk management
2
...
...
... ...
59
intellig share
information sharing
63
60
inform share
information sharing
64
df_corpus
content
id
meta.name
meta._split_id
0
market grow cyber attack...
56a2a2e28954537131a4aa734f49e361
14_Group_AG_2021
0
1
sec form file index
7aedfd4df02687d3dff9897c925da508
14_Group_AG_2021
1
...
...
...
...
213769
cyber secur alert parent compani fina...
ab10325601597f203f3f0af7aa647112
17_La_Banque_2021
8581
213770
intellig share statement parent compani fina...
6af5687ac31849d19d2048e0b2ca472d
17_La_Banque_2021
8582
I am trying to extract a count of each term listed in df_dora.content within df_corpus.content grouped by df_content.meta.name.
I tried to use isin
df = df_corpus[df_corpus.content.isin(df_dora.content)]
len(df)
Returns only 17 rows
content
id
meta.name
meta
41474
incid
a4c478e0fad1b9775c05e01d871b3aaf
3_Agricole_2021
10185
68690
oper risk
2e5139d82c242c89523110cc1110647a
10_Banking_Group_PLC_2021
5525
...
...
...
...
...
99259
risk report
a84eefb9a4772d13eb67f2d6ae5215cb
31_Building_Society_2021
4820
105662
risk manag
e8050be841fedb6dd10599e8b4892a9f
43_Bank_SA_2021
131
df_corpus.loc[df_corpus.content.isin(df_dora.content), 'content'].tolist()
also returns 17 rows
if I search for 2 of the terms that exist in df_dora directly in df_corpus
resiliency_term = df_corpus.loc[df_corpus['content'].str.contains("cyber risk|inform share", case=False)]
print(resiliency_term)
I get 243 rows (which matches what was in the original file.)
So given the above...my question is this how do I extract a count of each term listed in df_dora.content within df_corpus.content grouped by df_content.meta.name.
Thanks in advance for any help.
unique_vals = '|'.join(df_dora.content.unique())
df_corpus.groupby('meta.name').apply(lambda x: x.content.str.findall(unique_vals).explode().value_counts())
Output given your four lines of each:
17_La_Banque_2021 intellig share 1
Name: content, dtype: int64

Pivoting with grouby?

I wonder if you can help me to find a solution for the following problem. Given a data frame df1 like this
d1={'L':['aaa','bbb','ccc','aaa','bbb','ddd'],
'w':[1,5,9,13,17,21],
'x':[2,6,10,14,18,22],
'y':[3,7,11,15,19,23],
'z':[4,8,12,16,20,24]}
df1=pd.DataFrame(d1)
and two dictionaries to define grouping over columns and rows
dctRowGroups={'aaa':'A','bbb':'B','ccc':'A','ddd':'B'}
dctColGroups={'w':'ALPHA','x':'BETA','y':'ALPHA','z':'BETA'}
I wanted to aggregate over columns as a first step. Applying
g2=df1.groupby(dctColGroups,axis=1)
g2.sum()
results in
but I wanted to keep the 'L' column for the next step row-wise aggregation, i.e. the result should be a dataframe df2 more like this:
What do I need to code to make this happen?
As a next step, I want to aggregate df2 over the rows using the dctRowGroups dictionary
g3=df2.groupby(dctRowGroups,axis=0)
g3.sum()
to get a final result like this:
In what way can I do all these steps in as few lines of code as possible?
Appreciate your advice on this.
Thanks a lot
Willfried.
You can do:
Firstly create df2 and insert 'L' column by using insert() method:
df2=df1.groupby(dctColGroups,axis=1).sum()
df2.insert(0,'L',df1['L']) #use this only when the order matters
#OR(use anyone of the method either insert or assign)
df2=df2.assign(L=df1['L']) #otherwise use this
Finally use assign() ,map() and groupby() method:
result=df2.assign(L=df2['L'].map(dctRowGroups)).groupby('L').sum()
Outputs:
df2:
L ALPHA BETA
0 aaa 4 6
1 bbb 12 14
2 ccc 20 22
3 aaa 28 30
4 bbb 36 38
5 ddd 44 46
result:
ALPHA BETA
L
A 52 58
B 92 98

pandas create Cross-Validation based on specific columns

I have a dataframe of few hundreds rows , that can be grouped to ids as follows:
df = Val1 Val2 Val3 Id
2 2 8 b
1 2 3 a
5 7 8 z
5 1 4 a
0 9 0 c
3 1 3 b
2 7 5 z
7 2 8 c
6 5 5 d
...
5 1 8 a
4 9 0 z
1 8 2 z
I want to use GridSearchCV , but with a custom CV that will assure that all the rows from the same ID will always be on the same set.
So either all the rows if a are in the test set , or all of them are in the train set - and so for all the different IDs.
I want to have 5 folds - so 80% of the ids will go to the train and 20% to the test.
I understand that it can't guarentee that all folds will have the exact same amount of rows - since one ID might have more rows than the other.
What is the best way to do so?
As stated, you can provide cv with an iterator. You can use GroupShuffleSplit(). For example, once you use it to split your dataset, you can put the result within GridSearchCV() for the cv parameter.
As mentioned in the sklearn documentation, there's a parameter called "cv" where you can provide "An iterable yielding (train, test) splits as arrays of indices."
Do check out the documentation in future first.
As mentioned previously, GroupShuffleSplit() splits data based on group lables. However, the test sets aren't necessarily disjoint (i.e. doing multiple splits, an ID may appear in multiple test sets). If you want each ID to appear in exactly one test fold, you could use GroupKFold(). This is also available in Sklearn.model_selection, and directly extends KFold to take into account group lables.

SAS INPUT COLUMN

I have a problem in SAS, I would like to know how can I input several columns in only one column(put everything in a single variable)?
For example, I have 3 columns but I would like to put this 3 columns in only one column.
like this:
1 2 3
1 3 1
3 4 4
output:
1
1
3
2
3
4
3
1
4
I'm assuming you're reading from a file, so use the trailing ## to keep reading variables past the end of the line:
data want;
input a ##;
cards;
1 2 3
1 3 1
3 4 4
;
run;
If the dataset is not big just split it to several small data set with one variable each, then rename all variables to one name and concatenate vertiacally using simple set statement. I am sure there are more elegant solutions than this one and if your data set is big let me know, I will write the actual code needed to perform this action with optimal coding