Nested dictionary to pandas df

Nested dictionary to pandas df - pandas

My first question in stackoverflow!
I have a triple nested dictionary and I want to convert it to pandas df.
The dictionary has the following structure:
dictionary = {'CompanyA': {'Revenue': {date1 : $1}, {date2: $2}},...
{'ProfitLoss': {date1 : $0}, {date2: $1}}},
'CompanyB': {'Revenue': {date1 : $1}, {date2: $2}},...
{'ProfitLoss': {date1 : $0}, {date2: $1}}},
'CompanyC': {'Revenue': {date1 : $1}, {date2: $2}},...
{'ProfitLoss': {date1 : $0}, {date2: $1}}}}
So far I been able to construct a df using:
df = pd.DataFrame.from_dict(dictionary)
But the results its a df with values as dictionaries like this:
CompanyA CompanyB CompanyC
Revenue {date1:$0,..} {date1:$1,..} {date1:$0,..}
ProfitLoss{date1:$0,..} {date1:$0,..} {date1:$0,..}
I want the table to look like this:
CompanyA CompanyB CompanyC
Revenue Date1 $1 $1 $1
Date2 $2 $2 $2
ProfitLoss Date1 $0 $0 $0
Date2 $1 $1 $1
I had tried using pd.MultiIndex.from_dict (.from_product) and change the index, with no result. Any idea what to do next? Any hint will be appreciated!

I see you're new, but there may be an answer to a similar question, see this. Next time try looking for a similar question using keywords. For example, I found the one I linked by searching "pandas nested dict", and that's it, the first link was the SO post!
Anyway, you need to reshape your input dict. You want a dict structured like this:
{
'CompanyA': {
('Revenue', 'date1'): 1,
('ProfitLoss', 'date1'): 0,
}
...
}
I would do something like this:
import pandas as pd
data = {
'CompanyA': {
'Revenue': {
"date1": 1,
"date2": 2
},
'ProfitLoss': {
"date1": 0,
"date2": 1
}
},
'CompanyB': {
'Revenue': {
"date1": 4,
"date2": 5
},
'ProfitLoss': {
"date1": 2,
"date2": 3
}
}
}
# Reshape your data and pass it to `DataFrame.from_dict`
df = pd.DataFrame.from_dict({i: {(j, k): data[i][j][k]
for j in data[i] for k in data[i][j]}
for i in data}, orient="columns")
print(df)
Output:
CompanyA CompanyB
ProfitLoss date1 0 2
date2 1 3
Revenue date1 1 4
date2 2 5
EDIT
Using actual datetimes to respond to your comment:
import pandas as pd
import datetime as dt
date1 = dt.datetime.now()
date2 = date1 + dt.timedelta(days=365)
data = {
'CompanyA': {
'Revenue': {
date1: 1,
date2: 2
},
'ProfitLoss': {
date1: 0,
date2: 1
}
},
'CompanyB': {
'Revenue': {
date1: 4,
date2: 5
},
'ProfitLoss': {
date1: 2,
date2: 3
}
}
}
# Reshape your data and pass it to `DataFrame.from_dict`
df = pd.DataFrame.from_dict({i: {(j, k): data[i][j][k]
for j in data[i] for k in data[i][j]}
for i in data}, orient="columns")
print(df)
Output:
CompanyA CompanyB
ProfitLoss 2018-10-08 11:19:09.006375 0 2
2019-10-08 11:19:09.006375 1 3
Revenue 2018-10-08 11:19:09.006375 1 4
2019-10-08 11:19:09.006375 2 5

Related

Pandas | Filter DF rows with Integers that lie between two Integer values in another Dataframe

I got two Dataframes. The goal is to filter out rows in DF1 that have an Integer value that lies between any of the Integers in the ["Begin"] and ["End"] columns in any of the 37 rows in DF2.
DF1:
INDEX String IntValues
1 "string" 808091
2 "string" 1168262
3 "string" 1169294
... ... ...
647 "string" 14193661
648 "string" 14551918
DF2:
Index Begin End
1 1196482.2 1216529
2 1791819.7 1834887
3 2008405.1 2014344
... ... ...
36 14168540.0 14193933
37 14727507.1 14779605
I think it is possible to use something like :
df1[(df1["IntValues"] >=1196482.2 ) & (df1["IntValues"] <= 1216529),(... 36 more conditions)].
Is there a better way than just writing down these 37 conditions, like a variable for the begin and end values of that "filter window" ?
Edit: As requested a code sample, not from the original DF, but i hope it suffices.
d1 = {
"string":["String0", "String1", "String2", "String3", "String4", "String5", "String6", "String7", "String8", "String9", "String10", "String11", "String12", "String13", "String14"],
"timestamp":[1168262,1169294, 1184451, 1210449,1210543,1210607, 1644328,
1665732, 1694388,1817309,1822872,1825310,2093796,2182923,2209252 ],
"should be in": ["Yes", "Yes", "Yes", "No", "No","No", "yes","yes","yes","no","no", "no","yes","yes","no"]
}
df1 = pd.DataFrame(d1)
d2={
'begin' : [1196482.2,1791819.7,2199564.6],
'end' : [1216529,1834887,2212352]
}
df2 = pd.DataFrame(d2)

Try this:
df_final=[]
for i,j in zip(df2["Begin"],df2["End"]):
x=df1[(df1["IntValues"] >=i ) & (df1["IntValues"] <= j)]
df_final.append(x)
df_final=pd.concat(df_final,axis=0).reset_index(drop=True)
df_final=df_final.drop_duplicates()

In apache spark SQL, how to remove the duplicate rows when using collect_list in window function?

I have below dataframe,
+----+-----+----+--------+
|year|month|item|quantity|
+----+-----+----+--------+
|2019|1 |TV |8 |
|2019|2 |AC |10 |
|2018|1 |TV |2 |
|2018|2 |AC |3 |
+----+-----+----+--------+
by using window function I wanted to get below output,
val partitionWindow = Window.partitionBy("year").orderBy("month")
val itemsList= collect_list(struct("item", "quantity")).over(partitionWindow)
df.select("year", itemsList as "items")
Expected output:
+----+-------------------+
|year|items |
+----+-------------------+
|2019|[[TV, 8], [AC, 10]]|
|2018|[[TV, 2], [AC, 3]] |
+----+-------------------+
But, when I use window function, there are duplicate rows for each item,
Current output:
+----+-------------------+
|year|items |
+----+-------------------+
|2019|[[TV, 8]] |
|2019|[[TV, 8], [AC, 10]]|
|2018|[[TV, 2]] |
|2018|[[TV, 2], [AC, 3]] |
+----+-------------------+
I wanted to know which is best way to remove the duplicate rows?

I believe the interesting part here is that the aggregated list of items is to be sorted by month. So I've written code in three approaches as :
Creating a sample dataset:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
case class data(year : Int, month : Int, item : String, quantity : Int)
val spark = SparkSession.builder().master("local").getOrCreate()
import spark.implicits._
val inputDF = spark.createDataset(Seq(
data(2018, 2, "AC", 3),
data(2019, 2, "AC", 10),
data(2019, 1, "TV", 2),
data(2018, 1, "TV", 2)
)).toDF()
Approach1: Aggregating month, item and quantiy into list and then sorting the items by month using UDF as:
case class items(item : String, quantity : Int)
def getItemsSortedByMonth(itemsRows : Seq[Row]) : Seq[items] = {
if (itemsRows == null || itemsRows.isEmpty) {
null
}
else {
itemsRows.sortBy(r => r.getAs[Int]("month"))
.map(r => items(r.getAs[String]("item"), r.getAs[Int]("quantity")))
}
}
val itemsSortedByMonthUDF = udf(getItemsSortedByMonth(_: Seq[Row]))
val outputDF = inputDF.groupBy(col("year"))
.agg(collect_list(struct("month", "item", "quantity")).as("items"))
.withColumn("items", itemsSortedByMonthUDF(col("items")))
Approach2: Using window functions
val monthWindowSpec = Window.partitionBy("year").orderBy("month")
val rowNumberWindowSpec = Window.partitionBy("year").orderBy("row_number")
val runningList = collect_list(struct("item", "quantity")). over(rowNumberWindowSpec)
val tempDF = inputDF
// using row_number for continuous ranks if there are multiple items in the same month
.withColumn("row_number", row_number().over(monthWindowSpec))
.withColumn("items", runningList)
.drop("month", "item", "quantity")
tempDF.persist()
val yearToSelect = tempDF.groupBy("year").agg(max("row_number").as("row_number"))
val outputDF = tempDF.join(yearToSelect, Seq("year", "row_number")).drop("row_number")
Edit:
Added the third approach for posterity using Dataset API's - groupByKey and mapGroups:
//encoding to data class can be avoided if inputDF is not converted dataset of row objects
val outputDF = inputDF.as[data].groupByKey(_.year).mapGroups{ case (year, rows) =>
val itemsSortedByMonth = rows.toSeq.sortBy(_.month).map(s => items(s.item, s.quantity))
(year, itemsSortedByMonth)
}.toDF("year", "items")

Initially I was looking for an approach without an UDF. That was OK except for once aspect that I could not solve elegantly. With a simple map UDF it is extremely simple, simpler than the other answers. So, for posterity and a little later due to other commitments.
Try this...
import spark.implicits._
import org.apache.spark.sql.functions._
case class abc(year: Int, month: Int, item: String, quantity: Int)
val itemsList= collect_list(struct("month", "item", "quantity"))
val my_udf = udf { items: Seq[Row] =>
val res = items.map { r => (r.getAs[String](1), r.getAs[Int](2)) }
res
}
// Gen some data, however, not the thrust of the problem.
val df0 = Seq(abc(2019, 1, "TV", 8), abc(2019, 7, "AC", 10), abc(2018, 1, "TV", 2), abc(2018, 2, "AC", 3), abc(2019, 2, "CO", 7)).toDS()
val df1 = df0.toDF()
val df2 = df1.groupBy($"year")
.agg(itemsList as "items")
.withColumn("sortedCol", sort_array($"items", asc = true))
.withColumn("sortedItems", my_udf(col("sortedCol") ))
.drop("items").drop("sortedCol")
.orderBy($"year".desc)
df2.show(false)
df2.printSchema()
Noting the following that you should fix:
order by later is better imho
mistakes in data (fixed now)
ordering mth by String is not a good idea, need to convert to mth num
Returns:
+----+----------------------------+
|year|sortedItems |
+----+----------------------------+
|2019|[[TV, 8], [CO, 7], [AC, 10]]|
|2018|[[TV, 2], [AC, 3]] |
+----+----------------------------+

datatables order by number doc and year of doc

I don't know how to implement a simple way to sort a column that show me number of doc with trailing year like:
COL 1 COL 2 COL 3 DOC. NR
------------------------------------
x x x 2/2020
------------------------------------
x x x 3/2020
------------------------------------
x x x 4/2021
------------------------------------
x x x 1/2022
------------------------------------
In my example i'd like to sort doc. nr col by asc or desc, based on number of doc (grouped by year).
I tried to set data-order with formula: numberdoc + year but it doesn't work cause, for example, 3 + 2020 is equal to 1 + 2022.... so this way is not correct. Any idea?
<td data-order="2020">1/2019</td>
Script:
$(document).ready( function () {
var table = $('#example').DataTable({
"aaSorting": [[ 1, "desc" ]],
"columnDefs": [
{
targets: [1],
data: {
_: "1.display",
sort: "1.#data-order",
type: "1.#data-order"
}
}
]
});
} );
My fiddle: http://live.datatables.net/jivekefo/1/edit
Expected (ASC) result:
1/2018
2/2018
3/2018
1/2019
2/2019
3/2019
4/2019
1/2020
...

Ok solved with this rule:
<td data-order="<?php echo (new Datetime($dateDoc))->format("Y") . str_pad(ltrim($numberDoc, "0"), 5, "0", STR_PAD_LEFT); /* example. 201900001 */ ?>">
http://live.datatables.net/jivekefo/2/edit

How to select pandas row(s) which has attributes column value equals to any one of value from list

data = {
"name": ["abc", "xyz", "pqr"],
"attributes": [["attr2", "attr3"], ["attr2","attr4"], ["attr3", "attr1"] ]
}
df = pd.DataFrame.from_dict(data)
How do i filter rows which satisfies this condition:
select row if it's attributes column contains values any of "attr1" or "attr3"
expected output is:
name attributes
0 "abc" ["attr2", "attr3"]
1 "pqr" ["attr3", "attr1"]

Using
df[pd.DataFrame(df.attributes.tolist()).isin(['attr1','attr3']).any(1)]
Out[295]:
attributes name
0 [attr2, attr3] abc
2 [attr3, attr1] pqr

To get a boolean indexer,
>>> idx = df['attributes'].map(lambda l: any(s in l for s in ['attr1', 'attr3']))
>>> idx
0 True
1 False
2 True
Name: attributes, dtype: bool
Then
>>> df.loc[idx]
name attributes
0 abc [attr2, attr3]
2 pqr [attr3, attr1]
Whether you want to reset the index afterward is up to you.

Change a Pandas DataFrame with Integer Index

I have converted a Python dict to pandas dataframe:
dict = {
u'erterreherh':
{
u'account': u'rgrgrgrg',
u'data': u'192.168.1.1',
},
u'hkghkghkghk':
{
u'account': u'uououopuopuop',
u'data': '192.168.1.170',
},
}
df = pd.DataFrame.from_dict(dict, orient='index')
account data
aa bbss
zz sssss
vv sss
"account" is index here. I want to dataframe like below, how can I do this?
account data
0 aa bbss
1 zz sssss
2 vv ss

You need rename_axis for change index name and last reset_index:
d = {
u'erterreherh':
{
u'account': u'rgrgrgrg',
u'data': u'192.168.1.1'
},
u'hkghkghkghk':
{
u'account': u'uououopuopuop',
u'data': '192.168.1.170'
}
}
df = pd.DataFrame.from_dict(d, orient='index')
df = df.rename_axis('acount1').reset_index()
print (df)
acount1 data account
0 erterreherh 192.168.1.1 rgrgrgrg
1 hkghkghkghk 192.168.1.170 uououopuopuop
If need overwrite column account by values from index:
df = df.assign(account=df.index).reset_index(drop=True)
print (df)
data account
0 192.168.1.1 erterreherh
1 192.168.1.170 hkghkghkghk

df.reset_index() is indeed working for me.
df
data
account
aa bbss
zz sssss
vv sss
df = df.reset_index()
account data
0 aa bbss
1 zz sssss
2 vv sss

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Nested dictionary to pandas df - pandas

Related

Pandas | Filter DF rows with Integers that lie between two Integer values in another Dataframe

In apache spark SQL, how to remove the duplicate rows when using collect_list in window function?

datatables order by number doc and year of doc

How to select pandas row(s) which has attributes column value equals to any one of value from list

Change a Pandas DataFrame with Integer Index

Categories

Resources