Pandas - Extracting value based of common key - pandas

I have a Dataframe in the below format:
id, key1, key2
101, {'key': 'key_1001', 'fields': {'type': {'subtask': False}, 'summary': 'Title_1' , 'id': '71150'}}, NaN
101, NaN,{'key': 'key_1002', 'fields': {'type': {'subtask': False}, 'summary': 'Title_2' , 'id': '71151'}}
102, {'key': 'key_2001', 'fields': {'type': {'subtask': False}, 'summary': 'Title_11' , 'id': '71160'}}, NaN
102, NaN,{'key': 'key_2002', 'fields': {'type': {'subtask': False}, 'summary': 'Title_12' , 'id': '71161'}}
I am trying to achieve the below output from the above Dataframe.
id, key_value_1, key_value_2
101, key_1001, key_1002
102, key_2001, key_2002
Output of df.dict()
{'id': {103: '101', 676: '101'}, 'key1' : {103: {'fields': {'type': {'subtask': False}, 'summary': 'Title_1' , 'id': '71150'},
676: nan}

You can use:
s=df.set_index('id').stack().str.get('key').unstack()
key1 key2
id
101 key_1001 key_1002
102 key_2001 key_2002

Related

Slicing PySpark DataFrame by converting to Pandas DataFrame, Error when converting back to PySpark DataFrame

I want to slice a PySpark DataFrame by selecting a specific column and several rows as below:
import pandas as pd
# Data filled in our DataFrame
rows = [['Lee Chong Wei', 69, 'Malaysia'],
['Lin Dan', 66, 'China'],
['Srikanth Kidambi', 9, 'India'],
['Kento Momota', 15, 'Japan']]
# Columns of our DataFrame
columns = ['Player', 'Titles', 'Country']
# DataFrame is created
df = spark.createDataFrame(rows, columns)
# Converting DataFrame to pandas
pandas_df = df.toPandas()
# First DataFrame formed by slicing
df1 = pandas_df.iloc[[2], :2]
# Second DataFrame formed by slicing
df2 = pandas_df.iloc[[2], 2:]
# Converting the slices to PySpark DataFrames
df1 = spark.createDataFrame(df1, schema = "Country")
df2 = spark.createDataFrame(df2, schema = "Country")
I am running a notebook on Databricks and no need to import Spark Session.
There is an error message ParseException: when running following lines:
df1 = spark.createDataFrame(df1, schema = "Country")
df2 = spark.createDataFrame(df2, schema = "Country")
Please let me know any idea to solve this issue. Full error message is as below:
---------------------------------------------------------------------------
ParseException Traceback (most recent call last)
<command-4065192899858765> in <module>
23
24 # Converting the slices to PySpark DataFrames
---> 25 df1 = spark.createDataFrame(df1, schema = "Country")
26 df2 = spark.createDataFrame(df2, schema = "Country")
/databricks/spark/python/pyspark/sql/session.py in createDataFrame(self, data, schema, samplingRatio, verifySchema)
706
707 if isinstance(schema, str):
--> 708 schema = _parse_datatype_string(schema)
709 elif isinstance(schema, (list, tuple)):
710 # Must re-encode any unicode strings to be consistent with StructField names
/databricks/spark/python/pyspark/sql/types.py in _parse_datatype_string(s)
841 return from_ddl_datatype("struct<%s>" % s.strip())
842 except:
--> 843 raise e
844
845
/databricks/spark/python/pyspark/sql/types.py in _parse_datatype_string(s)
831 try:
832 # DDL format, "fieldname datatype, fieldname datatype".
--> 833 return from_ddl_schema(s)
834 except Exception as e:
835 try:
/databricks/spark/python/pyspark/sql/types.py in from_ddl_schema(type_str)
823 def from_ddl_schema(type_str):
824 return _parse_datatype_json_string(
--> 825 sc._jvm.org.apache.spark.sql.types.StructType.fromDDL(type_str).json())
826
827 def from_ddl_datatype(type_str):
/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
1302
1303 answer = self.gateway_client.send_command(command)
-> 1304 return_value = get_return_value(
1305 answer, self.gateway_client, self.target_id, self.name)
1306
/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
121 # Hide where the exception came from that shows a non-Pythonic
122 # JVM exception message.
--> 123 raise converted from None
124 else:
125 raise
ParseException:
mismatched input '<EOF>' expecting {'APPLY', 'CALLED', 'CHANGES', 'CLONE', 'COLLECT', 'CONTAINS', 'CONVERT', 'COPY', 'COPY_OPTIONS', 'CREDENTIAL', 'CREDENTIALS', 'DEEP', 'DEFINER', 'DELTA', 'DETERMINISTIC', 'ENCRYPTION', 'EXPECT', 'FAIL', 'FILES', 'FORMAT_OPTIONS', 'HISTORY', 'INCREMENTAL', 'INPUT', 'INVOKER', 'LANGUAGE', 'LIVE', 'MATERIALIZED', 'MODIFIES', 'OPTIMIZE', 'PATTERN', 'READS', 'RESTORE', 'RETURN', 'RETURNS', 'SAMPLE', 'SCD TYPE 1', 'SCD TYPE 2', 'SECURITY', 'SEQUENCE', 'SHALLOW', 'SNAPSHOT', 'SPECIFIC', 'SQL', 'STORAGE', 'STREAMING', 'UPDATES', 'UP_TO_DATE', 'VIOLATION', 'ZORDER', 'ADD', 'AFTER', 'ALL', 'ALTER', 'ALWAYS', 'ANALYZE', 'AND', 'ANTI', 'ANY', 'ARCHIVE', 'ARRAY', 'AS', 'ASC', 'AT', 'AUTHORIZATION', 'BETWEEN', 'BOTH', 'BUCKET', 'BUCKETS', 'BY', 'CACHE', 'CASCADE', 'CASE', 'CAST', 'CATALOG', 'CATALOGS', 'CHANGE', 'CHECK', 'CLEAR', 'CLUSTER', 'CLUSTERED', 'CODE', 'CODEGEN', 'COLLATE', 'COLLECTION', 'COLUMN', 'COLUMNS', 'COMMENT', 'COMMIT', 'COMPACT', 'COMPACTIONS', 'COMPUTE', 'CONCATENATE', 'CONSTRAINT', 'COST', 'CREATE', 'CROSS', 'CUBE', 'CURRENT', 'CURRENT_DATE', 'CURRENT_TIME', 'CURRENT_TIMESTAMP', 'CURRENT_USER', 'DAY', 'DATA', 'DATABASE', 'DATABASES', 'DATEADD', 'DATEDIFF', 'DBPROPERTIES', 'DEFAULT', 'DEFINED', 'DELETE', 'DELIMITED', 'DESC', 'DESCRIBE', 'DFS', 'DIRECTORIES', 'DIRECTORY', 'DISTINCT', 'DISTRIBUTE', 'DIV', 'DROP', 'ELSE', 'END', 'ESCAPE', 'ESCAPED', 'EXCEPT', 'EXCHANGE', 'EXISTS', 'EXPLAIN', 'EXPORT', 'EXTENDED', 'EXTERNAL', 'EXTRACT', 'FALSE', 'FETCH', 'FIELDS', 'FILTER', 'FILEFORMAT', 'FIRST', 'FN', 'FOLLOWING', 'FOR', 'FOREIGN', 'FORMAT', 'FORMATTED', 'FROM', 'FULL', 'FUNCTION', 'FUNCTIONS', 'GENERATED', 'GLOBAL', 'GRANT', 'GRANTS', 'GROUP', 'GROUPING', 'HAVING', 'HOUR', 'IDENTITY', 'IF', 'IGNORE', 'IMPORT', 'IN', 'INCREMENT', 'INDEX', 'INDEXES', 'INNER', 'INPATH', 'INPUTFORMAT', 'INSERT', 'INTERSECT', 'INTERVAL', 'INTO', 'IS', 'ITEMS', 'JOIN', 'KEY', 'KEYS', 'LAST', 'LATERAL', 'LAZY', 'LEADING', 'LEFT', 'LIKE', 'ILIKE', 'LIMIT', 'LINES', 'LIST', 'LOAD', 'LOCAL', 'LOCATION', 'LOCK', 'LOCKS', 'LOGICAL', 'MACRO', 'MAP', 'MATCHED', 'MERGE', 'MINUTE', 'MONTH', 'MSCK', 'NAMESPACE', 'NAMESPACES', 'NATURAL', 'NO', NOT, 'NULL', 'NULLS', 'OF', 'ON', 'ONLY', 'OPTION', 'OPTIONS', 'OR', 'ORDER', 'OUT', 'OUTER', 'OUTPUTFORMAT', 'OVER', 'OVERLAPS', 'OVERLAY', 'OVERWRITE', 'PARTITION', 'PARTITIONED', 'PARTITIONS', 'PERCENTILE_CONT', 'PERCENT', 'PIVOT', 'PLACING', 'POSITION', 'PRECEDING', 'PRIMARY', 'PRINCIPALS', 'PROPERTIES', 'PROVIDER', 'PROVIDERS', 'PURGE', 'QUALIFY', 'QUERY', 'RANGE', 'RECIPIENT', 'RECIPIENTS', 'RECORDREADER', 'RECORDWRITER', 'RECOVER', 'REDUCE', 'REFERENCES', 'REFRESH', 'REMOVE', 'RENAME', 'REPAIR', 'REPEATABLE', 'REPLACE', 'REPLICAS', 'RESET', 'RESPECT', 'RESTRICT', 'REVOKE', 'RIGHT', RLIKE, 'ROLE', 'ROLES', 'ROLLBACK', 'ROLLUP', 'ROW', 'ROWS', 'SECOND', 'SCHEMA', 'SCHEMAS', 'SELECT', 'SEMI', 'SEPARATED', 'SERDE', 'SERDEPROPERTIES', 'SESSION_USER', 'SET', 'MINUS', 'SETS', 'SHARE', 'SHARES', 'SHOW', 'SKEWED', 'SOME', 'SORT', 'SORTED', 'START', 'STATISTICS', 'STORED', 'STRATIFY', 'STRUCT', 'SUBSTR', 'SUBSTRING', 'SYNC', 'SYSTEM_TIME', 'SYSTEM_VERSION', 'TABLE', 'TABLES', 'TABLESAMPLE', 'TBLPROPERTIES', TEMPORARY, 'TERMINATED', 'THEN', 'TIME', 'TIMESTAMP', 'TIMESTAMPADD', 'TIMESTAMPDIFF', 'TO', 'TOUCH', 'TRAILING', 'TRANSACTION', 'TRANSACTIONS', 'TRANSFORM', 'TRIM', 'TRUE', 'TRUNCATE', 'TRY_CAST', 'TYPE', 'UNARCHIVE', 'UNBOUNDED', 'UNCACHE', 'UNION', 'UNIQUE', 'UNKNOWN', 'UNLOCK', 'UNSET', 'UPDATE', 'USE', 'USER', 'USING', 'VALUES', 'VERSION', 'VIEW', 'VIEWS', 'WHEN', 'WHERE', 'WINDOW', 'WITH', 'WITHIN', 'YEAR', 'ZONE', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 7)
== SQL ==
Country
-------^^^

Pandas extract value from a key-value pair

I have a Datafrmae with output as shown below, I am trying to extract specific text
id,value
101,*sample value as shown below*
I am trying to extract the value corresponding to key in this text
Expected output
id, key, id_new
101,Ticket-123, 1001
Given below is how the data looks like:
{
'fields': {
'status': {
'statusCategory': {
'colorName': 'yellow',
'name': 'In Progress',
'key': 'indeterminate',
'id': 4
},
'description': '',
'id': '11000',
'name': 'In Progress'
},
'summary': 'Sample Text'
},
'key': 'Ticket-123',
'id': '1001'
}
Use Series.str.get:
df['key'] = df['value'].str.get('key')
df['id_new'] = df['value'].str.get('id')
print (df)
id value key id_new
0 101 {'fields': {'status': {'statusCategory': {'col... Ticket-123 1001
Tested Dataframe:
v = {
'fields': {
'status': {
'statusCategory': {
'colorName': 'yellow',
'name': 'In Progress',
'key': 'indeterminate',
'id': 4
},
'description': '',
'id': '11000',
'name': 'In Progress'
},
'summary': 'Sample Text'
},
'key': 'Ticket-123',
'id': '1001'
}
df = pd.DataFrame({'id':101, 'value':[v]})

Pandas - Extract value from Dataframe based on certain key value not in a sequence

I have a Dataframe in the below format:
id, ref
101, [{'id': '74947', 'type': {'id': '104', 'name': 'Sales', 'inward': 'Sales', 'outward': 'PO'}, 'inwardIssue': {'id': '76560', 'key': 'Prod-A'}}]
102, [{'id': '74948', 'type': {'id': '105', 'name': 'Return', 'inward': 'Return Order', 'outward': 'PO'}, 'inwardIssue': {'id': '76560', 'key': 'Prod-C'}},
{'id': '750001', 'type': {'id': '342', 'name': 'Sales', 'inward': 'Sales', 'outward': 'PO'}, 'inwardIssue': {'id': '76560', 'key': 'Prod-X'}}]
103, [{'id': '74949', 'type': {'id': '106', 'name': 'Sales', 'inward': 'Return Order', 'outward': 'PO'}, 'inwardIssue': {'id': '76560', 'key': 'Prod-B'}},
104, [{'id': '67543', 'type': {'id': '106', 'name': 'Other', 'inward': 'Return Order', 'outward': 'PO'}, 'inwardIssue': {'id': '76560', 'key': 'Prod-BA'}}]
I am trying to extract rows that have name = Sales and return back the below output:
101, Prod-A
102, Prod-X
103, Prod-B
I am able to extract the required data if the key value pair appears at the first instance but I am not able to do so if it is not the first instance like in the case of id = 102
df['names'] = df['ref'].str[0].str.get('type').str.get('name')
df['value'] = df['ref'].str[0].str.get('inwardIssue').str.get('key')
df['output'] = np.where(df['names'] == 'Sales', df['value'], 0)
Currently I am able to only get values for id = 101, 103
Let us do explode
s=pd.DataFrame(df.ref.explode().tolist())
s=s.loc[s.type.str.get('name').eq('Sales'),'inwardIssue'].str.get('key')
dfs=df.join(s,how='right')
id ref inwardIssue
0 101 [{'id': '74947', 'type': {'id': '104', 'name':... Prod-A
2 103 [{'id': '74949', 'type': {'id': '106', 'name':... Prod-X
3 104 [{'id': '67543', 'type': {'id': '106', 'name':... Prod-B
If you already have a dataframe in that format, you may convert it to json format and use pd.json_normalize to turn original df to a flat dataframe and slicing/filering on this flat dataframe.
df1 = pd.json_normalize(df.to_dict(orient='records'), 'ref')
The output of this flat dataframe df1
Out[83]:
id type.id type.name type.inward type.outward inwardIssue.id \
0 74947 104 Sales Sales PO 76560
1 74948 105 Return Return Order PO 76560
2 750001 342 Sales Sales PO 76560
3 74949 106 Sales Return Order PO 76560
4 67543 106 Other Return Order PO 76560
inwardIssue.key
0 Prod-A
1 Prod-C
2 Prod-X
3 Prod-B
4 Prod-BA
Finally, slicing on df1
df_final = df1.loc[df1['type.name'].eq('Sales'), ['type.id', 'inwardIssue.key']]
Out[88]:
type.id inwardIssue.key
0 104 Prod-A
2 342 Prod-X
3 106 Prod-B

Pandas - Extract value from Dataframe based on certain key value

I have a Dataframe in the below format:
id, ref
101, [{'id': '74947', 'type': {'id': '104', 'name': 'Sales', 'inward': 'Sales', 'outward': 'PO'}, 'inwardIssue': {'id': '76560', 'key': 'Prod-A'}}]
102, [{'id': '74948', 'type': {'id': '105', 'name': 'Return', 'inward': 'Return Order', 'outward': 'PO'}, 'inwardIssue': {'id': '76560', 'key': 'Prod-C'}}]
103, [{'id': '74949', 'type': {'id': '106', 'name': 'Sales', 'inward': 'Return Order', 'outward': 'PO'}, 'inwardIssue': {'id': '76560', 'key': 'Prod-B'}}]
I am trying to extract rows that have name = Sales and return back the below output:
id, value
101, Prod-A
103, Prod-B
Use str[0] for first lists with Series.str.get by values by keys of dicts:
#if necessary convert list/dict repr to list/dict
import ast
df['ref'] = df['ref'].apply(ast.literal_eval)
df['names'] = df['ref'].str[0].str.get('type').str.get('name')
df['value'] = df['ref'].str[0].str.get('inwardIssue').str.get('key')
print (df)
id ref names value
0 101 [{'id': '74947', 'type': {'id': '104', 'name':... Sales Prod-A
1 102 [{'id': '74948', 'type': {'id': '105', 'name':... Return Prod-C
2 103 [{'id': '74949', 'type': {'id': '106', 'name':... Sales Prod-B
And then filter by boolean indexing:
df1 = df.loc[df['names'].eq('Sales'), ['id','value']]
print (df1)
id value
0 101 Prod-A
2 103 Prod-B

pandas same attribute comparison

I have the following dataframe:
df = pd.DataFrame([{'name': 'a', 'label': 'false', 'score': 10},
{'name': 'a', 'label': 'true', 'score': 8},
{'name': 'c', 'label': 'false', 'score': 10},
{'name': 'c', 'label': 'true', 'score': 4},
{'name': 'd', 'label': 'false', 'score': 10},
{'name': 'd', 'label': 'true', 'score': 6},
])
I want to return names that have the "false" label score value higher than the score value of the "true" label with at least the double. In my example, it should return only the "c" name.
First you can pivot the data, and look at the ratio, filter what you want:
new_df = df.pivot(index='name',columns='label', values='score')
new_df[new_df['false'].div(new_df['true']).gt(2)]
output:
label false true
name
c 10 4
If you only want the label, you can do:
new_df.index[new_df['false'].div(new_df['true']).gt(2)].values
which gives
array(['c'], dtype=object)
Update: Since your data is result of orig_df.groupby().count(), you could instead do:
orig_df['label'].eq('true').groupby('name').mean()
and look at the rows with values <= 1/3.