I have a column in dataframe with values like the ones below:
{'id': 22188, 'value': 'trunk'}
{'id': 22170, 'value': 'motor'}
I want to replace the single quotes to double quotes to use as json field. I am trying:
df['column'] = df['column'].replace({'\'': '"'}, regex=True)
But nothing changes.
How can I do this?
Expected result:
{"id": 22188, "value": "trunk"}
{"id": 22170, "value": "motor"}
No need to special escape the characters, just use the opposite family.
And no need to use Regex neither
And you have to use str accessor to do string replacements
df['column'] = df['column'].str.replace("'",'"', regex=False)
Works fine with string fields
>>> pd.Series(["{'id': 22188, 'value': 'trunk'}","{'id': 22170, 'value': 'motor'}"])
0 {'id': 22188, 'value': 'trunk'}
1 {'id': 22170, 'value': 'motor'}
dtype: object
>>> pd.Series(["{'id': 22188, 'value': 'trunk'}","{'id': 22170, 'value': 'motor'}"]).str.replace("'",'"')
0 {"id": 22188, "value": "trunk"}
1 {"id": 22170, "value": "motor"}
dtype: object
>>>
Fails with dictionaries => Convert it first with astype(str)
>>> pd.Series([{'id': 22188, 'value': 'trunk'},{'id': 22170, 'value': 'motor'}])
0 {'id': 22188, 'value': 'trunk'}
1 {'id': 22170, 'value': 'motor'}
dtype: object
>>> pd.Series([{'id': 22188, 'value': 'trunk'},{'id': 22170, 'value': 'motor'}]).str.replace("'",'"')
0 NaN
1 NaN
dtype: float64
>>> pd.Series([{'id': 22188, 'value': 'trunk'},{'id': 22170, 'value': 'motor'}]).astype(str).str.replace("'",'"')
0 {"id": 22188, "value": "trunk"}
1 {"id": 22170, "value": "motor"}
dtype: object
Related
Basically, I want to use iterrows method to loop through my group-by dataframe, but I can't figure out how the columns work. In the example below, it does not create a column Called "Group1" and "Group2" like one might expect. One of the columns is a dtype itself?
import pandas as pd
df = pd.DataFrame(columns=["Group1", "Group2", "Amount"])
df = df.append({"Group1": "Apple", "Group2": "Red Delicious", "Amount": 15}, ignore_index=True)
df = df.append({"Group1": "Apple", "Group2": "McIntosh", "Amount": 20}, ignore_index=True)
df = df.append({"Group1": "Apple", "Group2": "McIntosh", "Amount": 30}, ignore_index=True)
df = df.append({"Group1": "Apple", "Group2": "Fuju", "Amount": 7}, ignore_index=True)
df = df.append({"Group1": "Orange", "Group2": "Navel", "Amount": 9}, ignore_index=True)
df = df.append({"Group1": "Orange", "Group2": "Navel", "Amount": 5}, ignore_index=True)
df = df.append({"Group1": "Orange", "Group2": "Mandarin", "Amount": 12}, ignore_index=True)
print(df.dtypes)
print(df.to_string())
df_sum = df.groupby(['Group1', 'Group2']).sum(['Amount'])
print("---- Sum Results----")
print(df_sum.dtypes)
print(df_sum.to_string())
for index, row in df_sum.iterrows():
# The line below is what I want to do conceptually.
# print(row.Group1, row.Group2. row.Amount) # 'Series' object has no attribute 'Group1'
print(row.Amount) # 'Series' object has no attribute 'Group1'
The part of the output we are interested is here. I noticed that "Group1 and Group2" are on a lin below the Amount.
---- Sum Results----
Amount int64
dtype: object
Amount
Group1 Group2
Apple Fuju 7
McIntosh 50
Red Delicious 15
Orange Mandarin 12
Navel 14
Simply try:
df_sum = df.groupby(['Group1', 'Group2'])['Amount'].sum().reset_index()
OR
df_sum = df.groupby(['Group1', 'Group2'])['Amount'].agg('sum').reset_index()
Even, it Simply can be ad follows, as we are performing the sum based on the Group1 & Group2 only.
df_sum = df.groupby(['Group1', 'Group2']).sum().reset_index()
Another way:
df_sum = df.groupby(['Group1', 'Group2']).agg({'Amount': 'sum'}).reset_index()
Try to reset_index
df_sum = df.groupby(['Group1', 'Group2']).sum(['Amount']).reset_index()
Is there a smart pythonic way to parse a nested column in a pandas dataframe like this one to 3 different columns? So for example the column could look like this:
col1
[{'name': 'amount', 'value': 1}, {'name': 'frequency', 'value': 2}, {'name': 'freq_unit', 'value': 'month'}]
[{'name': 'amount', 'value': 3}, {'name': 'frequency', 'value': 1}, {'name': 'freq_unit', 'value': 'month'}]
And the expected result should be these 3 columns:
amount frequency freq_unit
1 2 month
3 1 month
That's just level 1. I have the level 2: What if the elements in the list still have the same names (amount, frequency and freq_unit) but the order could change? Could the code in the answer deal with this?
col1
[{'name': 'amount', 'value': 1}, {'name': 'frequency', 'value': 2}, {'name': 'freq_unit', 'value': 'month'}]
[{'name': 'amount', 'value': 3}, {'name': 'freq_unit', 'value': 'month'}, {'name': 'frequency', 'value': 1}]
Code for reproduce the data. Really look forward to see how the community would solve this. Thank you
data = {'col1':[[{'name': 'amount', 'value': 1}, {'name': 'frequency', 'value': 2}, {'name': 'freq_unit', 'value': 'month'}],
[{'name': 'amount', 'value': 3}, {'name': 'frequency', 'value': 1}, {'name': 'freq_unit', 'value': 'month'}]]}
df = pd.DataFrame(data)
A combination of list comprehension, itertools.chain, and collections.defaultdict could help out here:
from itertools import chain
from collections import defaultdict
data = defaultdict(list)
phase1 = [[(data["name"], data["value"])
for data in entry]
for entry in df.col1
]
phase1 = chain.from_iterable(phase1)
for key, value in phase1:
data[key].append(value)
pd.DataFrame(data)
amount frequency freq_unit
0 1 2 month
1 3 1 month
The above is verbose: #piRSquared's comment is much simpler, with a list comprehension:
pd.DataFrame([{x["name"]: x["value"] for x in lst} for lst in df.col1])
Another idea, but very unnecessary, is to use a list comprehension, combined with Pandas' string methods:
outcome = [(df.col1.str[num].str["value"]
.rename(df.col1.str[num].str["name"][0])
)
for num in range(df.col1.str.len()[0])
]
pd.concat(outcome, axis = 'columns')
#piRsquared's solution is the simplest, in my opinion.
You can write a function that will parse each cell in your Series and return a properly formatted Series and use apply to tuck the iteration away:
>>> def custom_parser(record):
... clean_record = {rec["name"]: rec["value"] for rec in record}
... return pd.Series(clean_record)
>>> df["col1"].apply(custom_parser)
amount frequency freq_unit
0 1 2 month
1 3 1 month
I have the following dataframe:
df = pd.DataFrame([{'name': 'a', 'label': 'false', 'score': 10},
{'name': 'a', 'label': 'true', 'score': 8},
{'name': 'c', 'label': 'false', 'score': 10},
{'name': 'c', 'label': 'true', 'score': 4},
{'name': 'd', 'label': 'false', 'score': 10},
{'name': 'd', 'label': 'true', 'score': 6},
])
I want to return names that have the "false" label score value higher than the score value of the "true" label with at least the double. In my example, it should return only the "c" name.
First you can pivot the data, and look at the ratio, filter what you want:
new_df = df.pivot(index='name',columns='label', values='score')
new_df[new_df['false'].div(new_df['true']).gt(2)]
output:
label false true
name
c 10 4
If you only want the label, you can do:
new_df.index[new_df['false'].div(new_df['true']).gt(2)].values
which gives
array(['c'], dtype=object)
Update: Since your data is result of orig_df.groupby().count(), you could instead do:
orig_df['label'].eq('true').groupby('name').mean()
and look at the rows with values <= 1/3.
i want to convert my this dataset
enter image description here
into this json format using pandas
y = {'name':['a','b','c'],"rollno":[1,2,3],"teacher":'xyz',"year":1998}
First create dictionary by DataFrame.to_dict and filter out duplicated lists for scalars in dictionary comprehension with if-else by check length of sets:
d = {k:v if len(set(v)) > 1 else v[0] for k, v in df.to_dict('l').items()}
print (d)
{'name': ['a', 'b', 'c'], 'rollno': [1, 2, 3], 'teacher': 'xyz', 'year': 1998}
And then convert to json:
import json
j = json.dumps(d)
print (j)
{"name": ["a", "b", "c"], "rollno": [1, 2, 3], "teacher": "xyz", "year": 1998}
If values should be duplicated:
import json
j = json.dumps(df.to_dict(orient='l'))
print (j)
{"name": ["a", "b", "c"], "rollno": [1, 2, 3],
"teacher": ["xyz", "xyz", "xyz"], "year": [1998, 1998, 1998]}
I have a table in a PostgreSQL 9.5 database with a JSONB field that contains a dictionary in the following form:
{'1': {'id': 1,
'length': 24,
'date_started': '2015-08-25'},
'2': {'id': 2,
'length': 27,
'date_started': '2015-09-18'},
'3': {'id': 3,
'length': 27,
'date_started': '2015-10-15'},
}
The number of elements in the dictionary (the '1', '2', etc.) may vary between rows.
I would like to be able to get the average of length using a single SQL query. Any suggestions on how to achieve this ?
Use jsonb_each:
[local] #= SELECT json, AVG((v->>'length')::int)
FROM j, jsonb_each(json) js(k, v)
GROUP BY json;
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─────────────────────┐
│ json │ avg │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────────────────┤
│ {"1": {"id": 1, "length": 240, "date_started": "2015-08-25"}, "2": {"id": 2, "length": 27, "date_started": "2015-09-18"}, "3": {"id": 3, "length": 27, "date_started": "2015-10-15"}} │ 98.0000000000000000 │
│ {"1": {"id": 1, "length": 24, "date_started": "2015-08-25"}, "2": {"id": 2, "length": 27, "date_started": "2015-09-18"}, "3": {"id": 3, "length": 27, "date_started": "2015-10-15"}} │ 26.0000000000000000 │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴─────────────────────┘
(2 rows)
Time: 0,596 ms