(Python)Dataframe copy with selected columns and index - dataframe

Whole dataframe can be copied to df2 as below.
How to copy only 'B' column and index in df to df2?
import pandas as pd
df = pd.DataFrame({'A': [10, 20, 30],'B': [100, 200, 300]}, index=['2021-11-24', '2021-11-25', '2021-11-26'])
df2 = df.copy()

You can simply select and then copy as follows:
df2 = df[['B']].copy()
I am using a list as the selection in order to have a DataFrame instead of a pd.Series.

Related

Parallel sampling and groupby in pandas

I have a large df (>=100k rows and 40 columns) that I am looking repeatedly sample and groupby. The code below works, but I was wondering if there is a way to speed up the process by parallelising any part of the process.
The df can live in shared memory, and nothing gets changed in the df, just need to return 1 or more aggregates for each column.
import pandas as pd
import numpy as np
from tqdm import tqdm
data = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
data['variant'] = np.repeat(['A', 'B'],50)
samples_list = []
for i in tqdm(range(0,1000)):
df = data.sample(
frac=1, # take the same number of samples as there are rows
replace=True, # allow the same row to be drawn multiple times
random_state=i # set state to be i for reproduceability
).groupby(['variant']).agg(
{
'A': 'count',
'B': [np.nanmean, np.sum, np.median, 'count'],
'C': [np.nanmean, np.sum],
'D': [np.sum]
}
)
df['experiment'] = i
samples_list.append( df )
# Convert to a df
samples = pd.concat(samples_list)
samples.head()
If you have enough memory, you can replicate the data first, then groupby and agg:
n_exp = 10
samples=(data.sample(n=len(data)*n_exp, replace=True, random_state=43)
.assign(experiment=np.repeat(np.arange(n_exp), len(data)) )
.groupby(['variant', 'experiment'])
.agg({'A': 'count',
'B': [np.nanmean, np.sum, np.median, 'count'],
'C': [np.nanmean, np.sum],
'D': [np.sum]
})
)
This is 4x faster with your sample data on my system.

how to consolidate series data and make new dataframe in pandas?

I've got dataframe like this
original data
and I hope to have new dataframe like below
new data
How can I create code for this modification?
It need to consolidate first series data and create new dataframe.
Some imports:
import pandas as pd
import numpy as np
Here we create dataframe from data you provided:
df = pd.DataFrame({
"a" : [
'A2C02158300', 'D REC/BAS16-03W 100V 250mA SOD323 0s SMD', 'D201,D206,D218,D219,D222,D302,D308,D408', 'D409,D501,D502,D505,D506,D507,D508',
'A2C02250500', 'T BIP/PUMD3,SOT363,SMD SOLDERING', 'T209,T501,T502'
]
})
df.head(10)
Output:
Then we prepare dataframe with first 2 columns:
s1 = df.iloc[::4, :]
s1.reset_index(drop=True, inplace=True)
s2 = df.iloc[1::4, :]
s2.reset_index(drop=True, inplace=True)
df = pd.DataFrame({
'a': s1['a'],
'b': s2['a']
})
After that prepare and add third column:
s3 = df.iloc[2::4, :]
s3.reset_index(drop=True, inplace=True)
s3 = s3['a'].str.split(',').apply(pd.Series, 1).stack()
s3.index = s3.index.droplevel(-1)
s3.name = 'c'
df = df.join(s3)
df.reset_index(drop=True, inplace=True)
df
Output:

How to apply a list comprehension in Panda Dataframe?

From a list of values, I try to identify any sequential pair of values whose sum exceeds 10
a = [1,9,3,4,5]
...so I wrote a for loop...
values = []
for i in range(len(a)-2):
if sum(a[i:i+2]) >10:
values += [a[i:i+2]]
...which I rewritten as a list comprehension...
values = [a[i:i+2] for i in range(len(a)-2) if sum(a[i:i+2]) >10]
Both produce same output:
values = [[1,9], [9,3]]
My question is how best may I apply the above list comprehension in a DataFrame.
Here is the sample 5 rows DataFrame
import pandas as pd
df = pd.DataFrame({'A': [1,1,1,1,0],
'B': [9,8,3,2,2],
'C': [3,3,3,10,3],
'E': [4,4,4,4,4],
'F': [5,5,5,5,5]})
df['X'] = df.values.tolist()
where:
- a is within a df['X'] which is a list of values Columns A - F
df['X'] = [[1,9,3,4,5],[1,8,3,4,5],[1,3,3,4,5],[1,2,10,4,5],[0,2,3,4,5]]
and, result of the list comprehension is to be store in new column df['X1]
Desired output is:
df['X1'] = [[[1,9], [9,3]],[[8,3]],[[NaN]],[[2,10],[10,4]],[[NaN]]]
Thank you.
You could use pandas apply function, and put your list comprehension in it.
df = pd.DataFrame({'A': [1,1,1,1,0],
'B': [9,8,3,2,2],
'C': [3,3,3,10,3],
'E': [4,4,4,4,4],
'F': [5,5,5,5,5]})
df['x'] = df.apply(lambda a: [a[i:i+2] for i in range(len(a)-2) if sum(a[i:i+2]) >= 10], axis=1)
#Note the axis parameters tells if you want to apply this function by rows or by columns, axis = 1 applies the function to each row.
This will give the output as stated in df['X1']

Performing operations after styling in a dataframe

I have 2 dataframes df1 and df2. I am trying to apply styling on df1, then drop a column from it and then finally concatenate with df2. Styling on df1 should be retained, though its being lost
I am using the code as listed below, though doesn't seem to work
df1 = pd.DataFrame([["A", 1],["B", 2]], columns=["Letter", "Number"])
df2 = pd.DataFrame([["A", 1],["B", 2]], columns=["Letter2", "Number2"])
def highlight(s):
return ['background-color: red']*2
df1 = df1.style.apply(highlight)
df1.data = df1.data.drop('Letter', axis=1)
combined = pd.concat([df1.data, df2],sort=True)
with pd.ExcelWriter('testcolor.xlsx') as writer:
combined.to_excel(writer,sheet_name = 'test')
I am expecting "Number" from df1 to be highlighted red and Letter2 and Number2 to be in original colour

loop through data frame

I want to change the orders of data frames using for loop but it doesn't work. My code is as follows:
import pandas as pd
df1 = pd.DataFrame({'a':1, 'b':2}, index=1)
df2 = pd.DataFrame({'c':3, 'c':4}, index=1)
for df in [df1, df2]:
df = df.loc[:, df.columns.tolist()[::-1]]
Then the order of columns of df1 and df2 is not changed.
You can make use of chain assignment with list comprehension i.e
df1,df2 = [i.loc[:,i.columns[::-1]] for i in [df1,df2]]
print(df1)
b a
1 2 1
print(df2)
c
1 4
Note: In my answer I am trying to build up to show that using a dictionary to store the datafrmes is the best way for a general case. If you are looking to mutate the original dataframe variables, #Bharath answer is the way to go.
Answer:
The code doesn't work because you are not assigning back to the list of dataframes. Here's how to fix that:
import pandas as pd
df1 = pd.DataFrame({'a':1, 'b':2}, index=[1])
df2 = pd.DataFrame({'c':3, 'c':4}, index=[1])
l = [df1, df2]
for i, df in enumerate(l):
l[i] = df.loc[:, df.columns.tolist()[::-1]]
so the difference, is that I iterate with enumerate to get the dataframe and it's index in the list, then I assign the changed dataframe to the original position in the list.
execution details:
Before apply the change:
In [28]: for i in l:
...: print(i.head())
...:
a b
1 1 2
c
1 4
In [29]: for i, df in enumerate(l):
...: l[i] = df.loc[:, df.columns.tolist()[::-1]]
...:
After applying the change:
In [30]: for i in l:
...: print(i.head())
...:
b a
1 2 1
c
1 4
Improvement proposal:
It's better to use a dictionary as follows:
import pandas as pd
d= {}
d['df1'] = pd.DataFrame({'a':1, 'b':2}, index=[1])
d['df2'] = pd.DataFrame({'c':3, 'c':4}, index=[1])
for i,df in d.items():
d[i] = df.loc[:, df.columns.tolist()[::-1]]
Then you will be able to reference your dataframes from the dictionary. For instance d['df1']
You can reverse columns and values:
import pandas as pd
df1 = pd.DataFrame({'a':1, 'b': 2}, index=[1])
df2 = pd.DataFrame({'c':3, 'c': 4}, index=[1])
print('before')
print(df1)
for df in [df1, df2]:
df.values[:,:] = df.values[:, ::-1]
df.columns = df.columns[::-1]
print('after')
print(df1)
df1
Output:
before
a b
1 1 2
after
b a
1 2 1