Panda key value pair data frame - pandas

Does panda can convert the key value to customized table. Here is the sample of the data.
1675484100 customer=A.1 area=1 height=20 width={10,10} length=1
1675484101 customer=B.1 area=10 height=30 width={20,11} length=2
1675484102 customer=C.1 area=11 height=40 width={30,12} length=3 remarks=call
Generate a table with key as a header and the associated value. First field as a time.

I would use a regex to get each key/value pair, then reshape:
data = '''1675484100 customer=A.1 area=1 height=20 width={10,10} length=1
1675484101 customer=B.1 area=10 height=30 width={20,11} length=2
1675484102 customer=C.1 area=11 height=40 width={30,12} length=3 remarks=call'''
df = (pd.Series(data.splitlines()).radd('time=')
.str.extractall(r'([^\s=]+)=([^\s=]+)')
.droplevel('match').set_index(0, append=True)[1]
# unstack keeping order
.pipe(lambda d: d.unstack()[d.index.get_level_values(-1).unique()])
)
print(df)
Output:
0 time customer area height width length remarks
0 1675484100 A.1 1 20 {10,10} 1 NaN
1 1675484101 B.1 10 30 {20,11} 2 NaN
2 1675484102 C.1 11 40 {30,12} 3 call

Assuming that your input is a string defined as data, you can use this :
L = [{k: v for k, v in (x.split("=") for x in l.split()[1:])}
for l in data.split("\n") if l.strip()]
​
df = pd.DataFrame(L)
​
df.insert(0, "time", [pd.to_datetime(int(x.split()[0]), unit="s")
for x in data.split("\n")])
Otherwise, if the data are stored in some sort of a (.txt) file, add this at the beginning :
with open("file.txt", "r") as f:
data = f.read()
Output :
print(df)
​
time customer area height width length remarks
0 2023-02-04 04:15:00 A.1 1 20 {10,10} 1 NaN
1 2023-02-04 04:15:01 B.1 10 30 {20,11} 2 NaN
2 2023-02-04 04:15:02 C.1 11 40 {30,12} 3 call

Related

pandas finding duplicate rows with different label

I have the case where I want to sanity check labeled data. I have hundreds of features and want to find points which have the same features but different label. These found cluster of disagreeing labels should then be numbered and put into a new dataframe.
This isn't hard but I am wondering what the most elegant solution for this is.
Here an example:
import pandas as pd
df = pd.DataFrame({
"feature_1" : [0,0,0,4,4,2],
"feature_2" : [0,5,5,1,1,3],
"label" : ["A","A","B","B","D","A"]
})
result_df = pd.DataFrame({
"cluster_index" : [0,0,1,1],
"feature_1" : [0,0,4,4],
"feature_2" : [5,5,1,1],
"label" : ["A","B","B","D"]
})
In order to get the output you want (both de-duplication and cluster_index), you can use a groupby approach:
g = df.groupby(['feature_1', 'feature_2'])['label']
(df.assign(cluster_index=g.ngroup()) # get group name
.loc[g.transform('size').gt(1)] # filter the non-duplicates
# line below only to have a nice cluster_index range (0,1…)
.assign(cluster_index= lambda d: d['cluster_index'].factorize()[0])
)
output:
feature_1 feature_2 label cluster_index
1 0 5 A 0
2 0 5 B 0
3 4 1 B 1
4 4 1 D 1
First get all duplicated values per feature columns and then if necessary remove duplciated by all columns (here in sample data not necessary), last add GroupBy.ngroup for groups indices:
df = df[df.duplicated(['feature_1','feature_2'],keep=False)].drop_duplicates()
df['cluster_index'] = df.groupby(['feature_1', 'feature_2'])['label'].ngroup()
print (df)
feature_1 feature_2 label cluster_index
1 0 5 A 0
2 0 5 B 0
3 4 1 B 1
4 4 1 D 1

Pandas Groupby and Apply

I am performing a grouby and apply over a dataframe that is returning some strange results, I am using pandas 1.3.1
Here is the code:
ddf = pd.DataFrame({
"id": [1,1,1,1,2]
})
def do_something(df):
return "x"
ddf["title"] = ddf.groupby("id").apply(do_something)
ddf
I would expect every row in the title column to be assigned the value "x" but when this happens I get this data:
id title
0 1 NaN
1 1 x
2 1 x
3 1 NaN
4 2 NaN
Is this expected?
The result is not strange, it's the right behavior: apply returns a value for the group, here 1 and 2 which becomes the index of the aggregation:
>>> list(ddf.groupby("id"))
[(1, # the group name (the future index of the grouped df)
id # the subset dataframe of the group 2
0 1
1 1
2 1
3 1),
(2, # the group name (the future index of the grouped df)
id # the subset dataframe of the group 2
4 2)]
Why I have a result? Because the label of the group is found in the same of your dataframe index:
>>> ddf.groupby("id").apply(do_something)
id
1 x
2 x
dtype: object
Now change the id like this:
ddf['id'] += 10
# id
# 0 11
# 1 11
# 2 11
# 3 11
# 4 12
ddf["title"] = ddf.groupby("id").apply(do_something)
# id title
# 0 11 NaN
# 1 11 NaN
# 2 11 NaN
# 3 11 NaN
# 4 12 NaN
Or change the index:
ddf.index += 10
# id
# 10 1
# 11 1
# 12 1
# 13 1
# 14 2
ddf["title"] = ddf.groupby("id").apply(do_something)
# id title
# 10 1 NaN
# 11 1 NaN
# 12 1 NaN
# 13 1 NaN
# 14 2 NaN
Yes it is expected.
First of all the apply(do_something) part works like a charme, it is the groupby right before that causes the problem.
A Groupby returns a groupby object, which is a little different to a normal dataframe. If you debug and inspect what the groupby returns, then you can see you need some form of summary function to use it(mean max or sum).If you run one of them as example like this:
df = ddf.groupby("id")
df.mean()
it leads to this result:
Empty DataFrame
Columns: []
Index: [1, 2]
After that do_something is applied to index 1 and 2 only; and then integrated into your original df. This is why you only have index 1 and 2 with x.
For now I would recommend leave out the groupby since it is not clear why you want to use it here anyway.
And have a deeper look into the groupby object
If need new column in aggregate function use GroupBy.transform, is necessary specified column after groupby used for processing, here id:
ddf["title"] = ddf.groupby("id")['id'].transform(do_something)
Or assign new column in function:
def do_something(x):
x['title'] = 'x'
return x
ddf = ddf.groupby("id").apply(do_something)
Explanation why not workin gis in another answers.

Mapping dictionary of lists to a pandas df

I have a dictionary which contains a id and a list of corresponding values for that id.
I am attempting to map this dictionary to a pandas df.
The df contains the same id to map to, but it needs to map the items in that list in order of appearance within the df.
For example:
sample_dict = {0:[0.1,0.4,0.5], 1:[0.2,0.14,0.3], 2:[0.2,0.1,0.4]}
The df looks like:
The output of mapping the dictionary to the df would look like:
Sorry for typing the table out like this, the actual df is very large, and I'm still new to stack exchange and pandas.
The end output should just map the id list value in order to the players as they appear in order as the df is sorted by id and then player
Let us try explode with reindex
df['new'] = pd.Series(sample_dict).reindex(df.id.unique()).explode().values
df
Out[140]:
id Player new
0 0 1 0.1
1 0 2 0.4
2 0 3 0.5
3 1 1 0.2
4 1 2 0.14
5 1 3 0.3
6 2 1 0.2
7 2 2 0.1
8 2 3 0.4

Why this inconsistency between a Dataframe and a column of it?

When debugging a nasty error in my code I come across this that looks that an inconsistency in the way Dataframes work (using pandas = 1.0.3):
import pandas as pd
df = pd.DataFrame([[10*k, 11, 22, 33] for k in range(4)], columns=['d', 'k', 'c1', 'c2'])
y = df.k
X = df[['c1', 'c2']]
Then I tried to add a column to y (forgetting that y is a Series, not a Dataframe):
y['d'] = df['d']
I'm now aware that this adds a weird row to the Series; y is now:
0 11
1 11
2 11
3 11
d 0 0
1 10
2 20
3 30
Name: d, dtype...
Name: k, dtype: object
But the weird thing is that now:
>>> df.shape, df['k'].shape
((4, 4), (5,))
And df and df['k'] look like:
d k c1 c2
0 0 11 22 33
1 10 11 22 33
2 20 11 22 33
3 30 11 22 33
and
0 11
1 11
2 11
3 11
d 0 0
1 10
2 20
3 30
Name: d, dtype...
Name: k, dtype: object
There are a few things at work here:
A pandas series can store objects of arbitrary types.
y['d'] = _ add a new object to the series y with name 'd'.
Thus, y['d'] = df['d'] add a new object to the series y with name 'd' and value is the series df['d'].
So you have added a series as the last entry of the series y. You can verify that
(y['d'] == y.iloc[-1]).all() == True and
(y.iloc[-1] == df['d']).all() == True.
To clarify the inconsistency between df and df.k: Note that df.k, df['k'], or df.loc[:, 'k'] returns the series 'view' of column k, thus, adding an entry to the series will directly append it to this view. However, df.k shows the entire series, whereas df only show the series to maximum length df.shape[0]. Hence the inconsistent behavior.
I agree that this behavior is prone to bugs and should be fixed. View vs. copy is a common cause for many issues. In this case, df.iloc[:, 1] behaves correctly and should be used instead.

conversion column names into first row

I would like to convert the following dataframe into a json .
df:
A sector B sector C sector
TTM Ratio                 --   35.99   12.70  20.63  14.75      23.06
RRM Sales            --  114.57    1.51   5.02   1.00    4594.13
MQR book         1.48    2.64    1.02   2.46   2.73       2.74
TTR cash        --   14.33    7.41  15.35   8.59  513854.86
In order to do so by using the function df.to_json() I would need to have unique names in column and indices.
Therefore what I am looking for is to convert the column names into a row and have default column numbers . In short I would like the following output:
df:
0 1 2 3 4 5
A sector B sector C sector
TTM Ratio                 --   35.99   12.70  20.63  14.75      23.06
RRM Sales            --  114.57    1.51   5.02   1.00    4594.13
MQR book         1.48    2.64    1.02   2.46   2.73       2.74
TTR cash        --   14.33    7.41  15.35   8.59  513854.86
Turning the column names into the first row so I can make the conversion correctly .
You could also use vstack in numpy:
>>> df
x y z
0 8 7 6
1 6 5 4
>>> pd.DataFrame(np.vstack([df.columns, df]))
0 1 2
0 x y z
1 8 7 6
2 6 5 4
The columns become the actual first row in this case.
Use assign by list of range and original column names:
print (range(len(df.columns)))
range(0, 6)
#for python2 list can be omit
df.columns = [list(range(len(df.columns))), df.columns]
Or MultiIndex.from_arrays:
df.columns = pd.MultiIndex.from_arrays([range(len(df.columns)), df.columns])
Also is possible use RangeIndex:
print (pd.RangeIndex(len(df.columns)))
RangeIndex(start=0, stop=6, step=1)
df.columns = pd.MultiIndex.from_arrays([pd.RangeIndex(len(df.columns)), df.columns])
print (df)
0 1 2 3 4 5
A sector B sector C sector
TTM Ratio -- 35.99 12.70 20.63 14.75 23.06
RRM Sales -- 114.57 1.51 5.02 1.00 4594.13
MQR book 1.48 2.64 1.02 2.46 2.73 2.74
TTR cash -- 14.33 7.41 15.35 8.59 513854.86