Transform DataFrame to list of dictionaries where column name is a value of key:value pair - pandas

I have a panda DataFrame as follow
|---------------------|------------------|------------------|
| A | B | C |
|---------------------|------------------|------------------|
| abc | 34 | 8 |
|---------------------|------------------|------------------|
| abc | | 12 |
|---------------------|------------------|------------------|
| abc | 6 | 321 |
|---------------------|------------------|------------------|
I would like to conver it to a list of dictionary like this:
[
{
name: "A",
value: "abc"
},
{
name: "B",
value: 34
},
{
name: "C",
value: 8
}
]
There are several way to do it with a lot of data manipulation but I am looking for one that is straightforward if it exists
Thank you for your help

[[{'name':k, 'value':v} for k,v in x.items()] for x in df.to_dict(orient='records')]
This would probably work, not sure it is straightforward though.

Related

Iterate through pandas data frame and replace some strings with numbers

I have a dataframe sample_df that looks like:
bar foo
0 rejected unidentified
1 clear caution
2 caution NaN
Note this is just a random made up df, there are lot of other columns lets say with different data types than just text. bar and foo might also have lots of empty cells/values which are NaNs.
The actual df looks like this, the above is just a sample btw:
| | Unnamed: 0 | user_id | result | face_comparison_result | created_at | facial_image_integrity_result | visual_authenticity_result | properties | attempt_id |
|-----:|-------------:|:---------------------------------|:---------|:-------------------------|:--------------------|:--------------------------------|:-----------------------------|:----------------|:---------------------------------|
| 0 | 58 | ecee468d4a124a8eafeec61271cd0da1 | clear | clear | 2017-06-20 17:50:43 | clear | clear | {} | 9e4277fc1ddf4a059da3dd2db35f6c76 |
| 1 | 76 | 1895d2b1782740bb8503b9bf3edf1ead | clear | clear | 2017-06-20 13:28:00 | clear | clear | {} | ab259d3cb33b4711b0a5174e4de1d72c |
| 2 | 217 | e71b27ea145249878b10f5b3f1fb4317 | clear | clear | 2017-06-18 21:18:31 | clear | clear | {} | 2b7f1c6f3fc5416286d9f1c97b15e8f9 |
| 3 | 221 | f512dc74bd1b4c109d9bd2981518a9f8 | clear | clear | 2017-06-18 22:17:29 | clear | clear | {} | ab5989375b514968b2ff2b21095ed1ef |
| 4 | 251 | 0685c7945d1349b7a954e1a0869bae4b | clear | clear | 2017-06-18 19:54:21 | caution | clear | {} | dd1b0b2dbe234f4cb747cc054de2fdd3 |
| 5 | 253 | 1a1a994f540147ab913fcd61b7a859d9 | clear | clear | 2017-06-18 20:05:05 | clear | clear | {} | 1475037353a848318a32324539a6947e |
| 6 | 334 | 26e89e4a60f1451285e70ca8dc5bc90e | clear | clear | 2017-06-17 20:21:54 | suspected | clear | {} | 244fa3e7cfdb48afb44844f064134fec |
| 7 | 340 | 41afdea02a9c42098a15d94a05e8452b | NaN | clear | 2017-06-17 20:42:53 | clear | clear | {} | b066a4043122437bafae3ddcf6c2ab07 |
| 8 | 424 | 6cf6eb05a3cc4aabb69c19956a055eb9 | rejected | NaN | 2017-06-16 20:00:26 |
I want to replace any strings I find with numbers, per the below mapping.
def no_strings(df):
columns=list(df)
for column in columns:
df[column] = df[column].map(result_map)
#We will need a mapping of strings to numbers to be able to analyse later.
result_map = {'unidentified':0,"clear": 1, 'suspected': 2,"caution" : 3, 'rejected':4}
So the output might look like:
bar foo
0 4 0
1 1 3
2 3 NaN
For some reason, when I run no_strings(sample_df) I get errors.
What am I doing wrong?
df['bar'] = df['bar'].map(result_map)
df['foo'] = df['foo'].map(result_map)
df
bar foo
0 4 0
1 1 3
2 3 2
However, if you wish to be on the safe side (assuming a key/value is not in your result_map and you dont want to see a NaN) do this:
df['foo'] = df['foo'].map(lambda x: result_map.get(x, 'not found'))
df['bar'] = df['bar'].map(lambda x: result_map.get(x, 'not found'))
so an out put for this df
bar foo
0 rejected unidentified
1 clear caution
2 caution suspected
3 sdgdg 0000
will result in:
bar foo
0 4 0
1 1 3
2 3 2
3 not found not found
To be extra efficient:
cols = ['foo','bar','other_columns']
for c in cols:
df[c] = df[c].map(lambda x: result_map.get(x, 'not found'))
Lets try stack, map the dict and then unstack
df.stack().to_frame()[0].map(result_map).unstack()
bar foo
0 4 0
1 1 3
2 3 2

Distinct Sum and Group by

I have a dataset [attached example] and I want to create 2 tables out of this;
+------+------------+-------+-------+-------+--------+
| corp | product | data | Group | sales | market |
+------+------------+-------+-------+-------+--------+
| A | Eli | 43831 | A | 100 | I |
| A | Eli | 43831 | B | 100 | I |
| B | Sut | 43831 | A | 80 | I |
| A | Api | 43831 | C | 50 | C or D |
| A | Api | 43831 | D | 50 | C or D |
| B | Konkurent2 | 43831 | C | 40 | C or D |
+------+------------+-------+-------+-------+--------+
1st - sum(sales) by market and exclude duplicated rows, so I want to end up with Sales for each market in specific date rage (Data column) but exclude duplicated - I have them because 1 product can be in more than 1 group
So first table, for exmaple, for MRCC I, would look like:
+--------+-------+-------+
| market | sales | data |
+--------+-------+-------+
| I | 180 | 43831 |
+--------+-------+-------+
Then second table I would like to look like above one, but add as a 'dictionary' aditionall column with uniqe product name within Market and Date, so for MRCC I it would look like:
+--------+-------+-------+----------------+
| market | sales | data | unique product |
+--------+-------+-------+----------------+
| I | 180 | 43831 | eli |
| I | 180 | 43831 | Sut |
+--------+-------+-------+----------------+
The thing is, im not that experienced in SQL, and i'm fairly new to DataProcessing, the system I am working in allows me to do some of data processing either by some "visual" recipes or by SQL code which im not that familiar with. And even moe confusing is I can choose between 3 SQL DBMS , Impala, Hive, Spark SQL - for example to create market column I used Impala and script looks like this, and im not sure if this is "pure" Impala syntax:
SELECT * from
(
-- mrc I --
SELECT *,case when
(`product`="Eli")
or
(`product`="Sut")
THEN "MRCC I"
end as market
FROM x.`y`
)a
where market is not null
Could you give me some tips on a structure of a code and if this is even possible?
Thanks,
eM
import spark.implicits._
import org.apache.spark.sql.functions._
case class Sale(
corp: String,
product: String,
data: Long,
group: String,
sales: Long,
market: String
)
val df = Seq(
Sale("A", "Eli", 43831, "A", 100, "I"),
Sale("A", "Eli", 43831, "B", 100, "I"),
Sale("A", "Sut", 43831, "A", 80, "I"),
Sale("A", "Api", 43831, "C", 50, "C or D"),
Sale("A", "Api", 43831, "D", 50, "C or D"),
Sale("B", "Konkurent2", 43831, "C", 40, "C or D")
).toDF()
val t2 = df.dropDuplicates(Seq("corp", "product", "data", "market"))
.groupBy("market", "product", "data").sum("sales")
.select(
'market,
col("sum(sales)").alias("sales"),
'data,
'product.alias("unique product")
)
t2.show(false)
// +------+-----+-----+--------------+
// |market|sales|data |unique product|
// +------+-----+-----+--------------+
// |I |80 |43831|Sut |
// |I |100 |43831|Eli |
// |C or D|40 |43831|Konkurent2 |
// |C or D|50 |43831|Api |
// +------+-----+-----+--------------+
val t1 = t2.drop("unique product")
.groupBy("market", "data").sum("sales")
.select(
'market,
col("sum(sales)").alias("sales"),
'data)
t1.show(false)
// +------+-----+-----+
// |market|sales|data |
// +------+-----+-----+
// |I |180 |43831|
// |C or D|90 |43831|
// +------+-----+-----+

Postgresql join on array and transform to json

I would like to make a join on array containing ids and transform the result of this subselect into json (json array).
I have the following model:

The lnam_refs column contains identifiers that are related to the lnam column
I would like transform the column lnam_refs into something like [row_to_json(), row_to_json()] or [] or [row_to_json()] or …
I tried several methods but I can not achieve a clean result…
To try to be clearer :
Table in input:
id | label | lnam | lnam_refs
--------+----------------------+----------+-----------------------
1 | 'master1' | 11111111 | {33333333}
2 | 'master2' | 22222222 | {44444444,55555555}
3 | 'slave1' | 33333333 | {}
4 | 'slave2' | 44444444 | {}
5 | 'slave3' | 55555555 | {}
6 | 'master3' | 66666666 | {}
Results Expected:
id | label | lnam | lnam_refs | slaves
--------+----------------------+----------+-----------------------+---------------------------------
1 | 'master1' | 11111111 | {33333333} | [ {id: 3, label: 'slave1', lnam: 33333333, lnam_refs: []} ]
2 | 'master2' | 22222222 | {44444444,55555555} | [ {id: 4, label: 'slave2', lnam: 44444444, lnam_refs: []}, {id: 5, label: 'slave3', lnam: 55555555, lnam_refs: []} ]
6 | 'master3' | 66666666 | {} | []
Thanks for your help !
Here's one way to do it. (I created a table called t with that data you supplied.)
SELECT *, (SELECT JSON_AGG(ROW_TO_JSON(t2)) FROM t t2 WHERE label LIKE 'slave%' AND lnam = ANY(t1.lnam_refs)) AS slaves
FROM t t1
WHERE label LIKE 'master%'
I use the label field in the WHERE clause as I don't know how else you're determining which records should be master etc.
Result:
1;master1;11111111;{33333333};[{"id":3,"label":"slave1","lnam":33333333,"lnam_refs":[]}]
2;master2;22222222;{44444444,55555555};[{"id":4,"label":"slave2","lnam":44444444,"lnam_refs":[]}, {"id":5,"label":"slave3","lnam":55555555,"lnam_refs":[]}]
6;master3;66666666;{};

How to create hierarchal json object using ltree query results? (postgresql)

I'm trying to create a storage system for custom categories using postgres.
After looking around for potential solutions I settled on trying to use ltree;
Here is an example of raw data below;
+----+---------+---------------------------------+-----------+
| id | user_id | path | name |
+----+---------+---------------------------------+-----------+
| 1 | 1 | root.test | test |
| 2 | 1 | root.test.inbox | inbox |
| 3 | 1 | root.personal | personal |
| 4 | 1 | root.project | project |
| 5 | 1 | root.project.idea | idea |
| 6 | 1 | root.personal.events | events |
| 7 | 1 | root.personal.events.janaury | january |
| 8 | 1 | root.project.objective | objective |
| 9 | 1 | root.personal.events.february | february |
| 10 | 1 | root.project.objective.january | january |
| 11 | 1 | root.project.objective.february | february |
+----+---------+---------------------------------+-----------+
I thought that it might be easier to first order the results, and remove the top level from the path return. Using;
select id, name, subpath(path, 1) as path, nlevel(subpath(path, 1)) as level from testLtree order by level, path
I get;
+----+-----------+----------------------------+-------+
| id | name | path | level |
+----+-----------+----------------------------+-------+
| 3 | personal | personal | 1 |
| 4 | project | project | 1 |
| 1 | test | test | 1 |
| 6 | events | personal.events | 2 |
| 5 | idea | project.idea | 2 |
| 8 | objective | project.objective | 2 |
| 2 | inbox | test.inbox | 2 |
| 9 | february | personal.events.february | 3 |
| 7 | january | personal.events.january | 3 |
| 11 | february | project.objective.february | 3 |
| 10 | january | project.objective.january | 3 |
+----+-----------+----------------------------+-------+
I'm hoping to be able to transform this result into a set of JSON data somehow. I would like an output similar to this;
personal: {
id: 3,
name: 'personal',
children: {
events: {
id: 6,
name: 'events',
children: {
january: {
id: 7,
name: 'january',
children: null
},
february: {
id: 9,
name: 'february',
children: null
}
}
}
}
},
project: {
id: 4,
name: 'project',
children: {
idea: {
id: 5,
name: 'idea',
children: null
},
objective: {
id: 8,
name: 'objective',
children: {
january: {
id: 10,
name: 'january',
children: null
},
february: {
id: 11,
name: 'february',
children: null
}
}
}
}]
},
test: {
id: 1,
name: 'test',
children: {
inbox: {
id: 2,
name: 'inbox',
children: null
}
}
}
I've been looking around for the best way to do this but haven't came across any solutions that make sense to me. However, as I am new to postgres and SQL in general this is expected.
I think I may have to use a recursive query? I'm a bit confused over what the best method/execution of this would be. Any help/advice is much appreciated! and any further questions please ask.
I've put everything into a sqlfiddle below;
http://sqlfiddle.com/#!17/1713e/5
I ran into the same problem as you. I had a large struggle with this in PostgreSQL and it became overly complex to solve. Since I'm using Django (Python framework), I decided to solve it using Python. In case it can help anyone in my same situation, I would like to share the code:
https://gist.github.com/eherrerosj/4685e3dc843e94f3ef8645d31dbe490c

Add Column in a Spark Dataframe ,based on a parametric sql query dependent on values of some fields of the dataframe

I have several Spark Dataframes(we can call them Table a, table b etc).
I want to add a column just to table a, based on a result of a query to one of the other tables, but this table will change every time based on a value of one of the fields of table a. So this query should be parametric.
Below I show an example to make the problem clear:
Every table has the column OID and a column TableName with the name of the current table, plus other columns.
This is the fixed query to be performed on Tab A to add new column:
Select $ColumnName from $TableName where OID=$oids
Tab A
| oids|TableName |ColumnName | other fields|New Column: ValueOidDb
================================================================
| 2 | Book | Title | x |result query:harry potter
| 8 | Book | Isbn | y |result query: 556
| 1 | Author | Name | z |result query:Tolkien
| 4 | Category |Description| b |result query: Commedy
Tab Book
| OID |TableName |Title |Isbn |other fields|
================================================================
| 2 | Book |harry potter| 123 | x |
| 8 | Book | hobbit | 556 | y |
| 21 | Book | etc | 8942 | z |
| 5 | Book | etc2 | 984 | b |
Tab Author
| OID |TableName |Name |nationality |other fields|
================================================================
| 5 | Author |J.Rowling | eng | x |
| 2 | Author |Geor. Martin| us | y |
| 1 | Author | Tolkien | eng | z |
| 13 | Author | Dan Brown | us | b |
| OID | TableName |Description |
=====================================
| 12 | Category | Fantasy |
| 4 | Category | Commedy |
| 9 | Category | Thriller |
| 7 | Category | Action |
I tried with this udf
def setValueOid = (oid: Int,TableName: String, TableColumn: String) => {
try{
sqlContext.sql(s"Select $currTableColumn from $currTableName where OID = $curroid ").first().toString()
}
catch{
case x: java.lang.NullPointerException => "error"
}
}
sqlContext.udf.register("setValueOid", setValueOid)
val FinalRtxf = sqlContext.sql("SELECT all the column of TAB A ,"
+ " setValueOid(oid, Table,AttributeDatabaseColumn ) as ValueOidDb"
+ " FROM TAB A")
I put the code in a try catch because otherwise it gives me a nullpointerexception, but it doesn't work, because it always returns a "problem".
If I try this function without a sql query by just passing some manual parameters it works perfectly:
val try=setValueOid(8,"BOOK","ISBN")
try: String = [0977326403 ] FINISHED
Took 4 sec. Last updated by anonymous at November 20 2016, 3:29:28 AM.
I read here that is not possible to make a query inside a udf
Trying to execute a spark sql query from a UDF
So how can I solve my problem? I don't know how to make a parametric join. I tried this:
%sql
Select all attributes TAB A,
FROM TAB A as a
join (Select $AttributeDatabaseColumn ,TableName from $Table where OID=$oid) as b
on a.Table=b.TableName
but it gave me this exception:
org.apache.spark.sql.AnalysisException: cannot recognize input near '$' 'AttributeDatabaseColumn' ',' in select clause; line 3 pos 1 at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:318)
at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
One option:
transform each Book, Author, Category to a form:
root
|-- oid: integer (nullable = false)
|-- tableName: string (nullable = true)
|-- properties: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
For example first record in Book:
val book = Seq((2L, "Book",
Map("title" -> "harry potter", "Isbn" -> "123", "other field" -> "x")
)).toDF("oid", "title", "properties")
+---+---------+---------------------------------------------------------+
|oid|tableName|properties |
+---+---------+---------------------------------------------------------+
|2 |Book |Map(title -> harry potter, Isbn -> 123, other field -> x)|
+---+---------+---------------------------------------------------------+
union Book, Author, Category as properties.
val properties = book.union(author).union(category)
join with base table:
val comb = properties.join(table, Seq($"oid", $"tableName"))
use case when ... based on tableName to add new column from properties field.