I want to make row 3 column index
The quick and easy answer is
df.T.set_index(3).T
I think you need select row with loc and drop this row from df:
df = pd.DataFrame({'A':['Groups'], 'B':['Quantity'], 'C':['Net Sales']}, index=[3])
df.columns = df.loc[3]
df = df.drop(3)
print (df)
Empty DataFrame
Columns: [Groups, Quantity, Net Sales]
Index: []
But better is avoid it e.g. use parameter skiprows if use read_csv for get DataFrame, main advantage is read_csv get right dtypes of all columns:
import pandas as pd
from pandas.compat import StringIO
temp=u"""A,B,C
D,E,F
G,H,I
J,K,L
Groups Quantity,Net,Sales
4,6,4"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp))
print (df)
A B C
0 D E F
1 G H I
2 J K L
3 Groups Quantity Net Sales
4 4 6 4
df = pd.read_csv(StringIO(temp), skiprows=4)
print (df)
Groups Quantity Net Sales
0 4 6 4
Timings:
In [319]: %timeit (df.T.set_index(3).T.reset_index(drop=True).astype(float).rename_axis(None, 1))
10 loops, best of 3: 43.1 ms per loop
In [320]: %timeit (jez(df))
10 loops, best of 3: 23.7 ms per loop
In [321]: %timeit (jez1(df))
100 loops, best of 3: 13.6 ms per loop
Code for timings:
Also is added converting to float to all solutions, if all data re strings then it is not necessary.
np.random.seed(100)
df = pd.DataFrame(np.random.random((100000,3)), columns=list('ABC'))
df = df.drop([0,1,2])
df.loc[3] = ['Groups', 'Quantity', 'Net Sales']
print (df)
print (df.T.set_index(3).T.reset_index(drop=True).astype(float).rename_axis(None, 1))
def jez(df):
df.columns = df.loc[3]
return df.drop(3).reset_index(drop=True).astype(float).rename_axis(None, 1)
def jez1(df):
arr = df.values
#get position (number of row) with 3
idx = df.index.get_loc(3)
return pd.DataFrame(np.delete(arr, (idx), axis=0).astype(float), columns=arr[idx])
Related
I'm trying to drop rows with missing values in any of several dataframes.
They all have the same number of rows, so I tried this:
model_data_with_NA = pd.concat([other_df,
standardized_numerical_data,
encode_categorical_data], axis=1)
ok_rows = ~(model_data_with_NA.isna().all(axis=1))
model_data = model_data_with_NA.dropna()
assert(sum(ok_rows) == len(model_data))
False!
As a newbie in Python, I wonder why this doesn't work? Also, is it better to use hierarchical indexing? Then I can extract the original columns from model_data.
In Short
I believe the all in ~(model_data_with_NA.isna().all(axis=1)) should be replaced with any.
The reason is that all checks here if every value in a row is missing, and any checks if one of the values is missing.
Full Example
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'a':[1, 2, 3]})
df2 = pd.DataFrame({'b':[1, np.nan]})
df3 = pd.DataFrame({'c': [1, 2, np.nan]})
model_data_with_na = pd.concat([df1, df2, df3], axis=1)
ok_rows = ~(model_data_with_na.isna().any(axis=1))
model_data = model_data_with_na.dropna()
assert(sum(ok_rows) == len(model_data))
model_data_with_na
a
b
c
0
1
1
1
1
2
nan
2
2
3
nan
nan
model_data
a
b
c
0
1
1
1
Below is part of a range in dataframe that I work withI have tried to sort it using df.sort_values(['column']) but it doesn't work. I would appreciate advice.
You can simplify solution for sorting by values before - converting to integers in parameter key:
f = lambda x: x.str.split('-').str[0].str.replace(',','', regex=True).astype(int)
df = df.sort_values('column', key= f, ignore_index=True)
print (df)
column
0 1,000-1,999
1 2,000-2,949
2 3,000-3,999
3 4,000-4,999
4 5,000-7,499
5 10,000-14,999
6 15,000-19,999
7 20,000-24,999
8 25,000-29,999
9 30,000-39,999
10 40,000-49,999
11 103,000-124,999
12 125,000-149,999
13 150,000-199,999
14 200,000-249,999
15 250,000-299,999
16 300,000-499,999
Another idea is use first integers for sorting:
f = lambda x: x.str.extract('(\d+)', expand=False).astype(int)
df = df.sort_values('column', key= f, ignore_index=True)
If need sorting by both values splitted by - it is possible by:
f = lambda x: x.str.replace(',','', regex=True).str.extract('(\d+)-(\d+)', expand=True).astype(int).apply(tuple, 1)
df = df.sort_values('column', key= f, ignore_index=True)
I want to add the values of one column
import pandas as pd
df= pd.DataFrame(data={"a":[1,2],"b":[102,4], "c":[4,5]})
# what I intended to do
df[["a","b"]] = df[["a","b"]] + df[["c"]]
Expected result:
df["a"] = df["a"] + df["c"]
df["b"] = df["b"] + df["c"]
You can assume a list of columns is available (["a", "b"]). is there a non loop / non line by line way of doing this? must be...
Use DataFrame.add with axis=0 and select c column only one [] for Series:
df[["a","b"]] = df[["a","b"]].add(df["c"], axis=0)
print (df)
a b c
0 5 106 4
1 7 9 5
I got a dataframe and I want to groupby the rows based on a specific column. Number of rows in each group will be at least 4 and at most 50. I want to save one column from the group into two lines. If the groupsize is even, let us say 2n, then n rows in one line and the remaining n in the second line. If it is odd, n+1 and n or n and n+1 will do.
For example,
import pandas as pd
from io import StringIO
data = """
id,name
1,A
1,B
1,C
1,D
2,E
2,F
2,ds
2,G
2, dsds
"""
df = pd.read_csv(StringIO(data))
I want to groupby id
df.groupby('id',sort=False)
and then get a dataframe like
id name
0 1 A B
1 1 C D
2 2 E F ds
3 2 G dsds
Probably not the most efficient solution, but it works:
import numpy as np
df = df.sort_values('id')
# next 3 lines: for each group find the separation
df['range_idx'] = range(0, df.shape[0])
df['mean_rank_group'] = df.groupby(['id'])['range_idx'].transform(np.mean)
df['separate_column'] = df['range_idx'] < df['mean_rank_group']
# groupby itself with the help of additional column
df.groupby(['id', 'separate_column'], as_index=False)['name'].agg(','.join).drop(
columns='separate_column')
This is a bit convoluted approach but it does the work;
def func(s: pd.Series):
mid = max(s.shape[0]//2 ,1)
l1 = ' '.join(list(s[:mid]))
l2 = ' '.join(list(s[mid:]))
return [l1, l2]
df_new = df.groupby('id').agg(func)
df_new["name1"]= df_new["name"].apply(lambda x: x[0])
df_new["name2"]= df_new["name"].apply(lambda x: x[1])
df = df_new.drop(labels="name", axis=1).stack().reset_index().drop(labels = ["level_1"], axis=1).rename(columns={0:"name"}).set_index("id")
I have a data set like this : {'IT',[1,20,35,44,51,....,1000]}
I want to convert this into python/pandas data frame. I want to see output in the below format. How to achieve this output.
Dept Count
IT 1
IT 20
IT 35
IT 44
IT 51
.. .
.. .
.. .
IT 1000
Below way i can write, but this is not efficient way for huge data.
data = [['IT',1],['IT',2],['IT',3]]
df = pd.DataFrame(data,columns=['Dept','Count'])
print(df)
No need for a list comprehension since pandas will automatically fill IT in for every row.
import pandas as pd
d = {'IT':[1,20,35,44,51,1000]}
df = pd.DataFrame({'dept': 'IT', 'count': d['IT']})
Use list comprehension for tuples and pass to DataFrame constructor:
d = {'IT':[1,20,35,44,51], 'NEW':[1000]}
data = [(k, x) for k, v in d.items() for x in v]
df = pd.DataFrame(data,columns=['Dept','Count'])
print(df)
Dept Count
0 IT 1
1 IT 20
2 IT 35
3 IT 44
4 IT 51
5 NEW 1000
You can use melt
import pandas as pd
d = {'IT': [10]*100000}
df = pd.DataFrame(d)
df = pd.melt(df, var_name='Dept', value_name='Count')