Nested dictionary to pandas df - pandas

My first question in stackoverflow!
I have a triple nested dictionary and I want to convert it to pandas df.
The dictionary has the following structure:
dictionary = {'CompanyA': {'Revenue': {date1 : $1}, {date2: $2}},...
{'ProfitLoss': {date1 : $0}, {date2: $1}}},
'CompanyB': {'Revenue': {date1 : $1}, {date2: $2}},...
{'ProfitLoss': {date1 : $0}, {date2: $1}}},
'CompanyC': {'Revenue': {date1 : $1}, {date2: $2}},...
{'ProfitLoss': {date1 : $0}, {date2: $1}}}}
So far I been able to construct a df using:
df = pd.DataFrame.from_dict(dictionary)
But the results its a df with values as dictionaries like this:
CompanyA CompanyB CompanyC
Revenue {date1:$0,..} {date1:$1,..} {date1:$0,..}
ProfitLoss{date1:$0,..} {date1:$0,..} {date1:$0,..}
I want the table to look like this:
CompanyA CompanyB CompanyC
Revenue Date1 $1 $1 $1
Date2 $2 $2 $2
ProfitLoss Date1 $0 $0 $0
Date2 $1 $1 $1
I had tried using pd.MultiIndex.from_dict (.from_product) and change the index, with no result. Any idea what to do next? Any hint will be appreciated!

I see you're new, but there may be an answer to a similar question, see this. Next time try looking for a similar question using keywords. For example, I found the one I linked by searching "pandas nested dict", and that's it, the first link was the SO post!
Anyway, you need to reshape your input dict. You want a dict structured like this:
{
'CompanyA': {
('Revenue', 'date1'): 1,
('ProfitLoss', 'date1'): 0,
}
...
}
I would do something like this:
import pandas as pd
data = {
'CompanyA': {
'Revenue': {
"date1": 1,
"date2": 2
},
'ProfitLoss': {
"date1": 0,
"date2": 1
}
},
'CompanyB': {
'Revenue': {
"date1": 4,
"date2": 5
},
'ProfitLoss': {
"date1": 2,
"date2": 3
}
}
}
# Reshape your data and pass it to `DataFrame.from_dict`
df = pd.DataFrame.from_dict({i: {(j, k): data[i][j][k]
for j in data[i] for k in data[i][j]}
for i in data}, orient="columns")
print(df)
Output:
CompanyA CompanyB
ProfitLoss date1 0 2
date2 1 3
Revenue date1 1 4
date2 2 5
EDIT
Using actual datetimes to respond to your comment:
import pandas as pd
import datetime as dt
date1 = dt.datetime.now()
date2 = date1 + dt.timedelta(days=365)
data = {
'CompanyA': {
'Revenue': {
date1: 1,
date2: 2
},
'ProfitLoss': {
date1: 0,
date2: 1
}
},
'CompanyB': {
'Revenue': {
date1: 4,
date2: 5
},
'ProfitLoss': {
date1: 2,
date2: 3
}
}
}
# Reshape your data and pass it to `DataFrame.from_dict`
df = pd.DataFrame.from_dict({i: {(j, k): data[i][j][k]
for j in data[i] for k in data[i][j]}
for i in data}, orient="columns")
print(df)
Output:
CompanyA CompanyB
ProfitLoss 2018-10-08 11:19:09.006375 0 2
2019-10-08 11:19:09.006375 1 3
Revenue 2018-10-08 11:19:09.006375 1 4
2019-10-08 11:19:09.006375 2 5

Related

Pandas | Filter DF rows with Integers that lie between two Integer values in another Dataframe

I got two Dataframes. The goal is to filter out rows in DF1 that have an Integer value that lies between any of the Integers in the ["Begin"] and ["End"] columns in any of the 37 rows in DF2.
DF1:
INDEX String IntValues
1 "string" 808091
2 "string" 1168262
3 "string" 1169294
... ... ...
647 "string" 14193661
648 "string" 14551918
DF2:
Index Begin End
1 1196482.2 1216529
2 1791819.7 1834887
3 2008405.1 2014344
... ... ...
36 14168540.0 14193933
37 14727507.1 14779605
I think it is possible to use something like :
df1[(df1["IntValues"] >=1196482.2 ) & (df1["IntValues"] <= 1216529),(... 36 more conditions)].
Is there a better way than just writing down these 37 conditions, like a variable for the begin and end values of that "filter window" ?
Edit: As requested a code sample, not from the original DF, but i hope it suffices.
d1 = {
"string":["String0", "String1", "String2", "String3", "String4", "String5", "String6", "String7", "String8", "String9", "String10", "String11", "String12", "String13", "String14"],
"timestamp":[1168262,1169294, 1184451, 1210449,1210543,1210607, 1644328,
1665732, 1694388,1817309,1822872,1825310,2093796,2182923,2209252 ],
"should be in": ["Yes", "Yes", "Yes", "No", "No","No", "yes","yes","yes","no","no", "no","yes","yes","no"]
}
df1 = pd.DataFrame(d1)
d2={
'begin' : [1196482.2,1791819.7,2199564.6],
'end' : [1216529,1834887,2212352]
}
df2 = pd.DataFrame(d2)
Try this:
df_final=[]
for i,j in zip(df2["Begin"],df2["End"]):
x=df1[(df1["IntValues"] >=i ) & (df1["IntValues"] <= j)]
df_final.append(x)
df_final=pd.concat(df_final,axis=0).reset_index(drop=True)
df_final=df_final.drop_duplicates()

In apache spark SQL, how to remove the duplicate rows when using collect_list in window function?

I have below dataframe,
+----+-----+----+--------+
|year|month|item|quantity|
+----+-----+----+--------+
|2019|1 |TV |8 |
|2019|2 |AC |10 |
|2018|1 |TV |2 |
|2018|2 |AC |3 |
+----+-----+----+--------+
by using window function I wanted to get below output,
val partitionWindow = Window.partitionBy("year").orderBy("month")
val itemsList= collect_list(struct("item", "quantity")).over(partitionWindow)
df.select("year", itemsList as "items")
Expected output:
+----+-------------------+
|year|items |
+----+-------------------+
|2019|[[TV, 8], [AC, 10]]|
|2018|[[TV, 2], [AC, 3]] |
+----+-------------------+
But, when I use window function, there are duplicate rows for each item,
Current output:
+----+-------------------+
|year|items |
+----+-------------------+
|2019|[[TV, 8]] |
|2019|[[TV, 8], [AC, 10]]|
|2018|[[TV, 2]] |
|2018|[[TV, 2], [AC, 3]] |
+----+-------------------+
I wanted to know which is best way to remove the duplicate rows?
I believe the interesting part here is that the aggregated list of items is to be sorted by month. So I've written code in three approaches as :
Creating a sample dataset:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
case class data(year : Int, month : Int, item : String, quantity : Int)
val spark = SparkSession.builder().master("local").getOrCreate()
import spark.implicits._
val inputDF = spark.createDataset(Seq(
data(2018, 2, "AC", 3),
data(2019, 2, "AC", 10),
data(2019, 1, "TV", 2),
data(2018, 1, "TV", 2)
)).toDF()
Approach1: Aggregating month, item and quantiy into list and then sorting the items by month using UDF as:
case class items(item : String, quantity : Int)
def getItemsSortedByMonth(itemsRows : Seq[Row]) : Seq[items] = {
if (itemsRows == null || itemsRows.isEmpty) {
null
}
else {
itemsRows.sortBy(r => r.getAs[Int]("month"))
.map(r => items(r.getAs[String]("item"), r.getAs[Int]("quantity")))
}
}
val itemsSortedByMonthUDF = udf(getItemsSortedByMonth(_: Seq[Row]))
val outputDF = inputDF.groupBy(col("year"))
.agg(collect_list(struct("month", "item", "quantity")).as("items"))
.withColumn("items", itemsSortedByMonthUDF(col("items")))
Approach2: Using window functions
val monthWindowSpec = Window.partitionBy("year").orderBy("month")
val rowNumberWindowSpec = Window.partitionBy("year").orderBy("row_number")
val runningList = collect_list(struct("item", "quantity")). over(rowNumberWindowSpec)
val tempDF = inputDF
// using row_number for continuous ranks if there are multiple items in the same month
.withColumn("row_number", row_number().over(monthWindowSpec))
.withColumn("items", runningList)
.drop("month", "item", "quantity")
tempDF.persist()
val yearToSelect = tempDF.groupBy("year").agg(max("row_number").as("row_number"))
val outputDF = tempDF.join(yearToSelect, Seq("year", "row_number")).drop("row_number")
Edit:
Added the third approach for posterity using Dataset API's - groupByKey and mapGroups:
//encoding to data class can be avoided if inputDF is not converted dataset of row objects
val outputDF = inputDF.as[data].groupByKey(_.year).mapGroups{ case (year, rows) =>
val itemsSortedByMonth = rows.toSeq.sortBy(_.month).map(s => items(s.item, s.quantity))
(year, itemsSortedByMonth)
}.toDF("year", "items")
Initially I was looking for an approach without an UDF. That was OK except for once aspect that I could not solve elegantly. With a simple map UDF it is extremely simple, simpler than the other answers. So, for posterity and a little later due to other commitments.
Try this...
import spark.implicits._
import org.apache.spark.sql.functions._
case class abc(year: Int, month: Int, item: String, quantity: Int)
val itemsList= collect_list(struct("month", "item", "quantity"))
val my_udf = udf { items: Seq[Row] =>
val res = items.map { r => (r.getAs[String](1), r.getAs[Int](2)) }
res
}
// Gen some data, however, not the thrust of the problem.
val df0 = Seq(abc(2019, 1, "TV", 8), abc(2019, 7, "AC", 10), abc(2018, 1, "TV", 2), abc(2018, 2, "AC", 3), abc(2019, 2, "CO", 7)).toDS()
val df1 = df0.toDF()
val df2 = df1.groupBy($"year")
.agg(itemsList as "items")
.withColumn("sortedCol", sort_array($"items", asc = true))
.withColumn("sortedItems", my_udf(col("sortedCol") ))
.drop("items").drop("sortedCol")
.orderBy($"year".desc)
df2.show(false)
df2.printSchema()
Noting the following that you should fix:
order by later is better imho
mistakes in data (fixed now)
ordering mth by String is not a good idea, need to convert to mth num
Returns:
+----+----------------------------+
|year|sortedItems |
+----+----------------------------+
|2019|[[TV, 8], [CO, 7], [AC, 10]]|
|2018|[[TV, 2], [AC, 3]] |
+----+----------------------------+

datatables order by number doc and year of doc

I don't know how to implement a simple way to sort a column that show me number of doc with trailing year like:
COL 1 COL 2 COL 3 DOC. NR
------------------------------------
x x x 2/2020
------------------------------------
x x x 3/2020
------------------------------------
x x x 4/2021
------------------------------------
x x x 1/2022
------------------------------------
In my example i'd like to sort doc. nr col by asc or desc, based on number of doc (grouped by year).
I tried to set data-order with formula: numberdoc + year but it doesn't work cause, for example, 3 + 2020 is equal to 1 + 2022.... so this way is not correct. Any idea?
<td data-order="2020">1/2019</td>
Script:
$(document).ready( function () {
var table = $('#example').DataTable({
"aaSorting": [[ 1, "desc" ]],
"columnDefs": [
{
targets: [1],
data: {
_: "1.display",
sort: "1.#data-order",
type: "1.#data-order"
}
}
]
});
} );
My fiddle: http://live.datatables.net/jivekefo/1/edit
Expected (ASC) result:
1/2018
2/2018
3/2018
1/2019
2/2019
3/2019
4/2019
1/2020
...
Ok solved with this rule:
<td data-order="<?php echo (new Datetime($dateDoc))->format("Y") . str_pad(ltrim($numberDoc, "0"), 5, "0", STR_PAD_LEFT); /* example. 201900001 */ ?>">
http://live.datatables.net/jivekefo/2/edit

How to select pandas row(s) which has attributes column value equals to any one of value from list

data = {
"name": ["abc", "xyz", "pqr"],
"attributes": [["attr2", "attr3"], ["attr2","attr4"], ["attr3", "attr1"] ]
}
df = pd.DataFrame.from_dict(data)
How do i filter rows which satisfies this condition:
select row if it's attributes column contains values any of "attr1" or "attr3"
expected output is:
name attributes
0 "abc" ["attr2", "attr3"]
1 "pqr" ["attr3", "attr1"]
Using
df[pd.DataFrame(df.attributes.tolist()).isin(['attr1','attr3']).any(1)]
Out[295]:
attributes name
0 [attr2, attr3] abc
2 [attr3, attr1] pqr
To get a boolean indexer,
>>> idx = df['attributes'].map(lambda l: any(s in l for s in ['attr1', 'attr3']))
>>> idx
0 True
1 False
2 True
Name: attributes, dtype: bool
Then
>>> df.loc[idx]
name attributes
0 abc [attr2, attr3]
2 pqr [attr3, attr1]
Whether you want to reset the index afterward is up to you.

Change a Pandas DataFrame with Integer Index

I have converted a Python dict to pandas dataframe:
dict = {
u'erterreherh':
{
u'account': u'rgrgrgrg',
u'data': u'192.168.1.1',
},
u'hkghkghkghk':
{
u'account': u'uououopuopuop',
u'data': '192.168.1.170',
},
}
df = pd.DataFrame.from_dict(dict, orient='index')
account data
aa bbss
zz sssss
vv sss
"account" is index here. I want to dataframe like below, how can I do this?
account data
0 aa bbss
1 zz sssss
2 vv ss
You need rename_axis for change index name and last reset_index:
d = {
u'erterreherh':
{
u'account': u'rgrgrgrg',
u'data': u'192.168.1.1'
},
u'hkghkghkghk':
{
u'account': u'uououopuopuop',
u'data': '192.168.1.170'
}
}
df = pd.DataFrame.from_dict(d, orient='index')
df = df.rename_axis('acount1').reset_index()
print (df)
acount1 data account
0 erterreherh 192.168.1.1 rgrgrgrg
1 hkghkghkghk 192.168.1.170 uououopuopuop
If need overwrite column account by values from index:
df = df.assign(account=df.index).reset_index(drop=True)
print (df)
data account
0 192.168.1.1 erterreherh
1 192.168.1.170 hkghkghkghk
df.reset_index() is indeed working for me.
df
data
account
aa bbss
zz sssss
vv sss
df = df.reset_index()
account data
0 aa bbss
1 zz sssss
2 vv sss