pandas multi-index from JSON document - pandas

How do I convert this multi-column index json document to pandas dataframe?
mylist=[
{
"col1": "val1",
"col2": [
{
"col21": "val21"
},
{
"col21": "val22"
}
]
}
]
What I have tried:
# pip install json-excel-converter[extra]
from json_excel_converter import Converter
from json_excel_converter.xlsx import Writer
import pandas as pd
conv = Converter()
conv.convert(mylist, Writer(file="test.xlsx"))
df = pd.read_excel("test.xlsx", header=list(range(2)))
To get this as expected:
col1 col2
Unnamed: 0_level_1 col21 col21.1
0 val1 val21 val22
But I have 2 questions:
Is there any way to avoid creating excel file and directly write to dataframe?
Is there any other way to achieve the same results?

Take a look at the structure of the JSON when it comes back from json_excel_converter
df.to_dict(orient="rows")
[{('col1', 'Unnamed: 0_level_1'): 'val1',
('col2', 'col21'): 'val21',
('col2', 'col21.1'): 'val22'}]
So need to produce that format from your input JSON. I've done this in two ways - old school looping and as a list comprehension.
Plus you need to tell pandas to create a multiindex from the tuples in the columns
newlist = []
for d in mylist:
myd = {}
for k in list(d.keys()):
if not isinstance(d[k], list):
myd = {**myd, (k, ""):d[k]}
else:
for x, i in enumerate(d[k]):
myd = {**myd, (k, f"{list(i)[0]}.{x}"):list(i.values())[0]}
newlist.append(myd)
newlist = [
{
**{(k,""):d[k]
for k in list(d.keys()) if not isinstance(d[k], list)
},
**{(k,f"{kk}.{i}"):l[kk]
for k in list(d.keys()) if isinstance(d[k], list)
for i, l in enumerate(d[k])
for kk in list(l.keys())
},
}
for d in mylist
]
df = pd.DataFrame(newlist)
df.columns= pd.MultiIndex.from_tuples(df.columns)
df
output
col1 col2
col21.0 col21.1
val1 val21 val22

Related

New column with word at nth position of string from other column pandas

import numpy as np
import pandas as pd
d = {'ABSTRACT_ID': [14145090,1900667, 8157202,6784974],
'TEXT': [
"velvet antlers vas are commonly used in tradit",
"we have taken a basic biologic RPA to elucidat4",
"ceftobiprole bpr is an investigational cephalo",
"lipoperoxidationderived aldehydes for example",],
'LOCATION': [1, 4, 2, 1]}
df = pd.DataFrame(data=d)
df
def word_at_pos(x,y):
pos=x
string= y
count = 0
res = ""
for word in string:
if word == ' ':
count = count + 1
if count == pos:
break
res = ""
else :
res = res + word
print(res)
word_at_pos(df.iloc[0,2],df.iloc[0,1])
For this df I want to create a new column WORD that contains the word from TEXT at the position indicated by LOCATION. e.g. first line would be "velvet".
I can do this for a single line as an isolated function world_at_pos(x,y), but can't work out how to apply this to whole column. I have done new columns with Lambda functions before, but can't work out how to fit this function to lambda.
Looping over TEXT and LOCATION could be the best idea because splitting creates a jagged array, so filtering using numpy advanced indexing won't be possible.
df["WORDS"] = [txt.split()[loc] for txt, loc in zip(df["TEXT"], df["LOCATION"]-1)]
print(df)
ABSTRACT_ID ... WORDS
0 14145090 ... velvet
1 1900667 ... a
2 8157202 ... bpr
3 6784974 ... lipoperoxidationderived
[4 rows x 4 columns]

How to create variable PySpark Dataframes by Dropping Null columns

I have 2 JSON files in a relative folder named 'source_data'
"source_data/data1.json"
{
"name": "John Doe",
"age": 32,
"address": "ZYZ - Heaven"
}
"source_data/data2.json"
{
"userName": "jdoe",
"password": "password",
"salary": "123456789"
}
Using the following PySpark code I have created DataFrame:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
df = spark.read.json("source_data")
print(df.head())
Output:
df.head(10)
[Row(name='John Doe', age=32, address='ZYZ - Heaven', userName=None, password=None, salary=None),
Row(name=None, age=None, address=None, userName='jdoe', password='password', salary='123456789')]
Now I want to create variable number of DataFrame, by dropping 'None' type column values, like this:
df1.head()
[Row(name='John Doe', age=32, address='ZYZ - Heaven']
and,
df2.head()
[Row(userName='jdoe', password='password', salary='123456789')]
I am only finding solutions for dropping entire row based on all or any column(s)
Is there any ways to achieve what I am looking for ?
TIA
You can just select the columns that you require in a different dataframe and filter that based on the condition.
//source data
val df = spark.read.json("path")
//select and filter
val df1 = df.select("address","age","name")
.filter($"address".isNotNull || $"age".isNotNull || $"name".isNotNull)
val df2 = df.select("password","salary","userName")
.filter($"password".isNotNull || $"salary".isNotNull || $"userName".isNotNull)
//see the output as dataframe or using head as you want
println(df1.head)
df2.head
Output for both the head command
df1 :
df2:

How sort multiindex dataframe by column value and maintain multiindex structure?

I have a multiindex (TestName and TestResult.Outcome) dataframe, want to sort descending by a column value and maintain the visual multiindex pair (TestName and TestResult.Outcome). How can I achieve that?
For example, I want to sort desc by column "n * %" for TestResult.Outcome index value "Failed" the following table:
I want to achieve the following outcome, maintaining the Pass Fail pairs in the indices:
I tried this:
orderedByTotalNxPercentDesc = myDf.sort_values(['TestResult.Outcome','n * %'], ascending=False)
but this orders firstly by index values = "Passed" and breaks the Passed Failed index pairs
This can help you:
import pandas as pd
import numpy as np
arrays = [np.array(["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"]),np.array(["one", "two", "one", "two", "one", "two", "one", "two"])]
df = pd.DataFrame(np.random.randn(8, 4), index=arrays)
df.reset_index().groupby(["level_0"]).apply(lambda x: x.sort_values([3], ascending = False)).set_index(['level_0','level_1'])
In your case 3 is your column n * %, level_0 is your index TestName and level_1 is your TestResult.Outcome.
Becomes:
I was able to get what I want by creating a dummy column for sorting:
iterables = [["bar", "baz", "foo", "qux"], ["one", "two"]]
df = pd.DataFrame(np.random.randn(8, 1), index=arrays)
df.index.names = ['level_0', 'level_1']
df = df.rename(columns={0: "myvalue"}, errors='raise')
for index, row in df.iterrows():
df.loc[index,'sort_dummy'] = df.loc[(index[0],'two'),'myvalue']
df = df.sort_values(['sort_dummy'], ascending = False)
df
Output:

Collapsing a PANDAs dataframe into a single column of all items and their occurances

I have a data frame consisting of a mixture of NaN's and strings e.g
data = {'String1':['NaN', 'tree', 'car', 'tree'],
'String2':['cat','dog','car','tree'],
'String3':['fish','tree','NaN','tree']}
ddf = pd.DataFrame(data)
I want to
1:count the total number of items and put in a new data frame e.g
NaN=2
tree=5
car=2
fish=1
cat=1
dog=1
2:Count the total number of items when compared to a separate longer list (column of a another data frame, e.g
df['compare'] =
NaN
tree
car
fish
cat
dog
rabbit
Pear
Orange
snow
rain
Thanks
Jason
For the first question:
from collections import Counter
data = {
"String1": ["NaN", "tree", "car", "tree"],
"String2": ["cat", "dog", "car", "tree"],
"String3": ["fish", "tree", "NaN", "tree"],
}
ddf = pd.DataFrame(data)
a = Counter(ddf.stack().tolist())
df_result = pd.DataFrame(dict(a), index=['Count']).T
df = pd.DataFrame({'vals':['NaN', 'tree', 'car', 'fish', 'cat', 'dog', 'rabbit', 'Pear', 'Orange', 'snow', 'rain']})
df_counts = df.vals.map(df_result.to_dict()['Count'])
THis should do :)
You can use the following code for count of items over all data frame.
import pandas as pd
data = {'String1':['NaN', 'tree', 'car', 'tree'],
'String2':['cat','dog','car','tree'],
'String3':['fish','tree','NaN','tree']}
df = pd.DataFrame(data)
def get_counts(df: pd.DataFrame) -> dict:
res = {}
for col in df.columns:
vc = df[col].value_counts().to_dict()
for k,v in vc.items():
if k in res:
res[k] += v
else:
res[k] = v
return res
counts = get_counts(df)
Output
>>> print(counts)
{'tree': 5, 'car': 2, 'NaN': 2, 'cat': 1, 'dog': 1, 'fish': 1}

Append values to pandas dataframe

I have this constant value and list of lists in my result. I need to add the constant and its corresponding list of list to a row in pandas dataframe.
Dataframe would have 2 columns - Col1 and Col2. I generate these values inside a for loop.
Code used to generate the values:
for key, elem in dict.items():
print key
length = len(elem)
elements = list(elem)
values = []
firsthalf = elements[:len(elemlist)/2]
print firsthalf
Values generated:
[[0.040456528559673702, -0.085805111083485666]]
11
-----
[[0.035220881869308676, -0.063623927372217309, 0.0063355856789509323]]
12
Dataframe:
Col1 Col2
[[0.040456528559673702, -0.085805111083485666]] 11
[[0.035220881869308676, -0.063623927372217309, 0.0063355856789509323]] 12
Any help would be appreciated. Thanks !!
It's easiest to append your objects to lists, then use those to initialize:
import pandas as pd
col1 = []
col2 = []
for key, elem in dict.items():
length = len(elem)
elements = list(elem)
values = []
firsthalf = elements[:len(elemlist)/2] # elemlist?
col1.append(key)
col2.append(firsthalf)
df = pd.DataFrame({'col1': col1, 'col2': col2})