conversion column names into first row - pandas

I would like to convert the following dataframe into a json .
df:
A sector B sector C sector
TTM Ratio                 --   35.99   12.70  20.63  14.75      23.06
RRM Sales            --  114.57    1.51   5.02   1.00    4594.13
MQR book         1.48    2.64    1.02   2.46   2.73       2.74
TTR cash        --   14.33    7.41  15.35   8.59  513854.86
In order to do so by using the function df.to_json() I would need to have unique names in column and indices.
Therefore what I am looking for is to convert the column names into a row and have default column numbers . In short I would like the following output:
df:
0 1 2 3 4 5
A sector B sector C sector
TTM Ratio                 --   35.99   12.70  20.63  14.75      23.06
RRM Sales            --  114.57    1.51   5.02   1.00    4594.13
MQR book         1.48    2.64    1.02   2.46   2.73       2.74
TTR cash        --   14.33    7.41  15.35   8.59  513854.86
Turning the column names into the first row so I can make the conversion correctly .

You could also use vstack in numpy:
>>> df
x y z
0 8 7 6
1 6 5 4
>>> pd.DataFrame(np.vstack([df.columns, df]))
0 1 2
0 x y z
1 8 7 6
2 6 5 4
The columns become the actual first row in this case.

Use assign by list of range and original column names:
print (range(len(df.columns)))
range(0, 6)
#for python2 list can be omit
df.columns = [list(range(len(df.columns))), df.columns]
Or MultiIndex.from_arrays:
df.columns = pd.MultiIndex.from_arrays([range(len(df.columns)), df.columns])
Also is possible use RangeIndex:
print (pd.RangeIndex(len(df.columns)))
RangeIndex(start=0, stop=6, step=1)
df.columns = pd.MultiIndex.from_arrays([pd.RangeIndex(len(df.columns)), df.columns])
print (df)
0 1 2 3 4 5
A sector B sector C sector
TTM Ratio -- 35.99 12.70 20.63 14.75 23.06
RRM Sales -- 114.57 1.51 5.02 1.00 4594.13
MQR book 1.48 2.64 1.02 2.46 2.73 2.74
TTR cash -- 14.33 7.41 15.35 8.59 513854.86

Related

Panda key value pair data frame

Does panda can convert the key value to customized table. Here is the sample of the data.
1675484100 customer=A.1 area=1 height=20 width={10,10} length=1
1675484101 customer=B.1 area=10 height=30 width={20,11} length=2
1675484102 customer=C.1 area=11 height=40 width={30,12} length=3 remarks=call
Generate a table with key as a header and the associated value. First field as a time.
I would use a regex to get each key/value pair, then reshape:
data = '''1675484100 customer=A.1 area=1 height=20 width={10,10} length=1
1675484101 customer=B.1 area=10 height=30 width={20,11} length=2
1675484102 customer=C.1 area=11 height=40 width={30,12} length=3 remarks=call'''
df = (pd.Series(data.splitlines()).radd('time=')
.str.extractall(r'([^\s=]+)=([^\s=]+)')
.droplevel('match').set_index(0, append=True)[1]
# unstack keeping order
.pipe(lambda d: d.unstack()[d.index.get_level_values(-1).unique()])
)
print(df)
Output:
0 time customer area height width length remarks
0 1675484100 A.1 1 20 {10,10} 1 NaN
1 1675484101 B.1 10 30 {20,11} 2 NaN
2 1675484102 C.1 11 40 {30,12} 3 call
Assuming that your input is a string defined as data, you can use this :
L = [{k: v for k, v in (x.split("=") for x in l.split()[1:])}
for l in data.split("\n") if l.strip()]
​
df = pd.DataFrame(L)
​
df.insert(0, "time", [pd.to_datetime(int(x.split()[0]), unit="s")
for x in data.split("\n")])
Otherwise, if the data are stored in some sort of a (.txt) file, add this at the beginning :
with open("file.txt", "r") as f:
data = f.read()
Output :
print(df)
​
time customer area height width length remarks
0 2023-02-04 04:15:00 A.1 1 20 {10,10} 1 NaN
1 2023-02-04 04:15:01 B.1 10 30 {20,11} 2 NaN
2 2023-02-04 04:15:02 C.1 11 40 {30,12} 3 call

Pandas - Move data in one column to the same row in a different column

I have a df which looks like the below, There are 2 quantity columns and I want to move the quantities in the "QTY 2" column to the "QTY" column
Note: there are no instances where there are values in the same row for both columns (So for each row, QTY is either populated or else QTY 2 is populated. Not Both)
DF
Index
Product
QTY
QTY 2
0
Shoes
5
1
Jumpers
10
2
T Shirts
15
3
Shorts
13
Desired Output
Index
Product
QTY
0
Shoes
5
1
Jumpers
10
2
T Shirts
15
3
Shorts
13
Thanks
Try this:
import numpy as np
df['QTY'] = np.where(df['QTY'].isnull(), df['QTY 2'], df['QTY'])
df["QTY"] = df["QTY"].fillna(df["QTY 2"], downcast="infer")
filling the gaps of QTY with QTY 2:
In [254]: df
Out[254]:
Index Product QTY QTY 2
0 0 Shoes 5.0 NaN
1 1 Jumpers NaN 10.0
2 2 T Shirts NaN 15.0
3 3 Shorts 13.0 NaN
In [255]: df["QTY"] = df["QTY"].fillna(df["QTY 2"], downcast="infer")
In [256]: df
Out[256]:
Index Product QTY QTY 2
0 0 Shoes 5 NaN
1 1 Jumpers 10 10.0
2 2 T Shirts 15 15.0
3 3 Shorts 13 NaN
downcast="infer" makes it "these look like integer after NaNs gone, so make the type integer".
you can drop QTY 2 after this with df = df.drop(columns="QTY 2"). If you want one-line is as usual possible:
df = (df.assign(QTY=df["QTY"].fillna(df["QTY 2"], downcast="infer"))
.drop(columns="QTY 2"))
You can do ( I am assuming your empty values are empty strings):
df = df.assign(QTY= df[['QTY', 'QTY2']].
replace('', 0).
sum(axis=1)).drop('QTY2', axis=1)
print(df):
Product QTY
0 Shoes 5
1 Jumpers 10
2 T Shirts 15
3 Shorts 13
If the empty values are actually NaNs then
df['QTY'] = df['QTY'].fillna(df['QTY2']) #or
df['QTY'] = df[['QTY', 'QTY2']].sum(1)

Pandas column merging on condition

This is my pandas df:
Id Protein A_Egg B_Meat C_Milk Category
A 10 10 20 0 egg
B 20 10 0 10 milk
C 20 10 10 10 meat
D 25 20 10 0 egg
I wish to merge protein column with other column based on "Category"
My output is
Id Protein_final
A 20
B 30
C 30
D 45
Ideally, I would like to show how I am approaching but, I am frankly clueless!!
EDIT: Also, How to handle is the category is blank or does meet one of the column (in that can final should be same as initial value in protein column)
Use DataFrame.lookup with some preprocessing with remove values in columns names before _ and lowercase, last add to column:
arr = df.rename(columns=lambda x: x.split('_')[-1].lower()).lookup(df.index, df['Category'])
df['Protein'] += arr
print (df)
Id Protein A_Egg B_Meat C_Milk Category
0 A 20 10 20 0 egg
1 B 30 10 0 10 milk
2 C 30 10 10 10 meat
3 D 45 20 10 0 egg
If need only 2 columns finally:
df = df[['Id','Protein']]
You can melt the dataframe, and filter for rows where category equals the variable column, and sum the final columns :
(
df
.melt(["Id", "Protein", "Category"])
.assign(variable=lambda x: x.variable.str[2:].str.lower(),
Protein_final=lambda x: x.Protein + x.value)
.query("Category == variable")
.filter(["Id", "Protein_final"])
)
Id Protein_final
0 A 20
3 D 45
6 C 30
9 B 30

Compare two data frames for different values in a column

I have two dataframe, please tell me how I can compare them by operator name, if it matches, then add the values ​​of quantity and time to the first data frame.
In [2]: df1 In [3]: df2
Out[2]: Out[3]:
Name count time Name count time
0 Bob 123 4:12:10 0 Rick 9 0:13:00
1 Alice 99 1:01:12 1 Jone 7 0:24:21
2 Sergei 78 0:18:01 2 Bob 10 0:15:13
85 rows x 3 columns 105 rows x 3 columns
I want to get:
In [5]: df1
Out[5]:
Name count time
0 Bob 133 4:27:23
1 Alice 99 1:01:12
2 Sergei 78 0:18:01
85 rows x 3 columns
Use set_index and add them together. Finally, update back.
df1 = df1.set_index('Name')
df1.update(df1 + df2.set_index('Name'))
df1 = df1.reset_index()
Out[759]:
Name count time
0 Bob 133.0 04:27:23
1 Alice 99.0 01:01:12
2 Sergei 78.0 00:18:01
Note: I assume time columns in both df1 and df2 are already in correct date/time format. If they are in string format, you need to convert them before running above commands as follows:
df1.time = pd.to_timedelta(df1.time)
df2.time = pd.to_timedelta(df2.time)

Groupby on two columns with bins(ranges) on one of them in Pandas Dataframe

I am trying to make segregate my data into buckets based on certain user attributes and I would like to see some counts in each of the buckets.For this I have imported this data into a Pandas Dataframe.
I have data that has user city, kids age and their unique id. I would like to know the count of users who reside in city A and have kids in age group 0-5.
Sample Data frame looks something like this:
city kids_age user_id
A 10 1
B 4 2
A 4 3
C 8 4
A 3 5
Expected Output:
city bin count
A 0-5 2
5-10 1
B 0-5 1
5-10 0
C 0-5 0
5-10 1
I tried group by on two columns city and kids age:
user_details_df_cropped_1.groupby(['city', 'kids_age']).count()
It gave me an output that looks something like this:
city kids_age user_id count
A 10 1 1
4 3 1
3 5 1
B 4 2 1
C 8 4 1
I returns me the users grouped by city, but not really by kids age bins(ranges). What am I missing here? Appreciate the help!!
Use cut for binning, pass to DataFrame.groupby, add 0 rows with DataFrame.stack
DataFrame.unstack an last convert to DataFrame by Series.reset_index:
bins = [0,5,10]
labels = ['{}-{}'.format(i, j) for i, j in zip(bins[:-1], bins[1:])]
b = pd.cut(df['kids_age'], bins=bins, labels=labels, include_lowest=True)
df = df.groupby(['city', b]).size().unstack(fill_value=0).stack().reset_index(name='count')
print (df)
city kids_age count
0 A 0-5 2
1 A 5-10 1
2 B 0-5 1
3 B 5-10 0
4 C 0-5 0
5 C 5-10 1
Another solution with DataFrame.reindex and MultiIndex.from_product for added mising rows filled by 0:
bins = [0,5,10]
labels = ['{}-{}'.format(i, j) for i, j in zip(bins[:-1], bins[1:])]
b = pd.cut(df['kids_age'], bins=bins, labels=labels, include_lowest=True)
mux = pd.MultiIndex.from_product([df['city'].unique(), labels], names=['city','kids_age'])
df = (df.groupby(['city', b])
.size()
.reindex(mux, fill_value=0)
.reset_index(name='count'))
print (df)
city kids_age count
0 A 0-5 2
1 A 5-10 1
2 B 0-5 1
3 B 5-10 0
4 C 0-5 0
5 C 5-10 1