My Pandas df is below. I wish to convert that to aggregated key-value pair. Below is what I have achieved and also where I am falling short.
import pandas as pd
import io
data = """
Name factory1 factory2 factory3
Philips China US
Facebook US
Taobao China Taiwan Australia
"""
df = pd.read_table(io.StringIO(data), delim_whitespace=True)
df.set_index('Name').to_dict('index')
{'Philips': {'factory1': 'China', 'factory2': 'US', 'factory3': nan},
'Facebook': {'factory1': 'US', 'factory2': nan, 'factory3': nan},
'Taobao': {'factory1': 'China', 'factory2': 'Taiwan', 'factory3': 'Australia'}}
My expected output is :
{'Philips': {'China', 'US'},
'Facebook': {'US'},
'Taobao': {'China', 'Taiwan', 'Australia'}}
is there someway to aggregate!
Let us try stack with groupby to_dict
out = df.set_index('Name').stack().groupby(level=0).agg(set).to_dict()
Out[109]:
{'Facebook': {'US'},
'Philips': {'China', 'US'},
'Taobao': {'Australia', 'China', 'Taiwan'}}
Related
I have a Dataframe as below. If a FID contains more than one polygons I need to create Multipolygon for FID. In my Dataframe I have FID 978 which contains two polygons so it should be converted to Multipolygon otherwise polygon.
|FID|Geometry|
|975|POLYGON ((-94.2019149289999 32.171910245, -94.201889847 32.171799917, -94.2019145369999 32.1719110220001, -94.2019344619999 32.171974117, -94.2019149289999 32.171910245))
|976|POLYGON ((-94.204485668 32.175813341, -94.2045721649999 32.1758854190001, -94.2044856639999 32.1758124690001, -94.204358881 32.1757171630001, -94.204485668 32.175813341))
|978|POLYGON ((-94.30755277 32.402906479, -94.321881945 32.4028226820001, -94.321896361 32.4035500580001, -94.3074214489999 32.4037557020001, -94.3075504129999 32.4029064600001, -94.30755277 32.402906479))
|978|POLYGON ((-94.30755277 32.402906479, -94.307552779 32.4005399370001, -94.307558688 32.4005401040001, -94.30755277 32.402906479))
I am using the following function to convert Multipolygons
def ploygon_to_multipolygon(wkt):
list_polygons = [wkt.loads(poly) for poly in wkt]
return shapely.geometry.MultiPolygon(list_polygons)
looks like polygons are not converting to Multipolygons.
from shapely.wkt import loads
from shapely.ops import unary_union
# Convert from wkt ...
# I think there's a better way to do this with Geopandas, this is pure pandas.
df.Geometry = df.Geometry.apply(loads)
# Use groupby and unary_union to combine Polygons.
df = df.groupby('FID', as_index=False)['Geometry'].apply(unary_union)
print(df)
# Let's print out the multi-polygon to verify
print(df.iat[2,1])
Output:
FID Geometry
0 975 POLYGON ((-94.2019149289999 32.171910245, -94....
1 976 POLYGON ((-94.204485668 32.175813341, -94.2045...
2 978 (POLYGON ((-94.30755277900001 32.4005399370001...
MULTIPOLYGON (((-94.30755277900001 32.4005399370001, -94.307558688 32.4005401040001, -94.30755277 32.402906479, -94.30755277900001 32.4005399370001)), ((-94.321881945 32.4028226820001, -94.321896361 32.4035500580001, -94.3074214489999 32.4037557020001, -94.3075504129999 32.4029064600001, -94.30755277 32.402906479, -94.321881945 32.4028226820001)))
I edited your function to return the Multipolygon as wkt and to check whether there are more than one polygons in the group.
import pandas as pd
from shapely import wkt, geometry
df = pd.DataFrame({
'FID': [975, 976, 978, 978],
'Geometry': [
'POLYGON ((-94.2019149289999 32.171910245, -94.201889847 32.171799917, -94.2019145369999 32.1719110220001, -94.2019344619999 32.171974117, -94.2019149289999 32.171910245))',
'POLYGON ((-94.204485668 32.175813341, -94.2045721649999 32.1758854190001, -94.2044856639999 32.1758124690001, -94.204358881 32.1757171630001, -94.204485668 32.175813341))',
'POLYGON ((-94.30755277 32.402906479, -94.321881945 32.4028226820001, -94.321896361 32.4035500580001, -94.3074214489999 32.4037557020001, -94.3075504129999 32.4029064600001, -94.30755277 32.402906479))',
'POLYGON ((-94.30755277 32.402906479, -94.307552779 32.4005399370001, -94.307558688 32.4005401040001, -94.30755277 32.402906479))',
],
})
def to_multipolygon(polygons):
return (
geometry.MultiPolygon([wkt.loads(polygon) for polygon in polygons]).wkt
if len(polygons) > 1
else polygons.iloc[0]
)
result = df.groupby('FID')['Geometry'].apply(lambda x: to_multipolygon(x))
print(result)
Output
FID
975 POLYGON ((-94.2019149289999 32.171910245, -94....
976 POLYGON ((-94.204485668 32.175813341, -94.2045...
978 MULTIPOLYGON (((-94.30755277 32.402906479, -94...
Name: Geometry, dtype: object
I am trying to groupby for the following specializations but I am not getting the expected result (or any for that matter). The data stays ungrouped even after this step. Any idea what's wrong in my code?
cols_specials = ['Enterprise ID','Specialization','Specialization Branches','Specialization Type']
specials = pd.read_csv(agg_specials, engine='python')
specials = specials.merge(roster, left_on='Enterprise ID', right_on='Enterprise ID', how='left')
specials = specials[cols_specials]
specials = specials.groupby(['Enterprise ID'])['Specialization'].transform(lambda x: '; '.join(str(x)))
specials.to_csv(end_report_specials, index=False, encoding='utf-8-sig')
Please try using agg:
import pandas as pd
df = pd.DataFrame(
[
['john', 'eng', 'build'],
['john', 'math', 'build'],
['kevin', 'math', 'asp'],
['nick', 'sci', 'spi']
],
columns = ['id', 'spec', 'type']
)
df.groupby(['id'])[['spec']].agg(lambda x: ';'.join(x))
resiults in:
if you need to preserve starting number of lines, use transform. transform returns one column:
df['spec_grouped'] = df.groupby(['id'])[['spec']].transform(lambda x: ';'.join(x))
df
results in:
I have sample dataframe
Date Announcement href
Apr 9, 2020 Hello World https://helloworld.com/
data = {'Date': ['c' , 'Apr 8,2010'], 'Announcement': ['Hello World A', 'Hello World B'], 'href': ['https://helloworld.com', 'https://helloworldb.com'}
df = pd.DataFrame(data, columns=['Date', 'Announcement', 'href']
df.to_excel("announce.xls', engine='xlswriter')
I am trying to figure out how can i just have output in xls as following: dataframe in announcement column should have a link to href
Date Announcement
Apr 9, 2020 Hello World
https://helloworld.com/
Updated to your embed the url in the cell. The trick is to use the *.xslx format, as opposed to the 1997 *.xls format:
import pandas as pd
data = {
'Date': ['c' , 'Apr 8,2010'],
'Announcement': ['=HYPERLINK("http://helloworld.com", "Hello World A")','=HYPERLINK("http://helloworldb.com", "Hello World B")'],
}
df = pd.DataFrame(data, columns=['Date', 'Announcement'])
df.to_excel('announce.xlsx')
I have a problem with the following scenario using PySpark version 2.0, I have a DataFrame with a column contains an array with start and end value, e.g.
[1000, 1010]
I would like to know how to create and compute another column which contains an array that holds all the values for the given range? the result of the generated range values column will be:
+--------------+-------------+-----------------------------+
| Description| Accounts| Range|
+--------------+-------------+-----------------------------+
| Range 1| [101, 105]| [101, 102, 103, 104, 105]|
| Range 2| [200, 203]| [200, 201, 202, 203]|
+--------------+-------------+-----------------------------+
Try this.
define the udf
def range_value(a):
start = a[0]
end = a[1] +1
return list(range(start,end))
from pyspark.sql import functions as F
from pyspark.sql import types as pt
df = spark.createDataFrame([("Range 1", list([101,105])), ("Range 2", list([200, 203]))],("Description", "Accounts"))
range_value= F.udf(range_value, pt.ArrayType(pt.IntegerType()))
df = df.withColumn('Range', range_value(F.col('Accounts')))
Output
you should use UDF (UDF sample)
Consider your pyspark data frame name is df, your data frame could be like this:
df = spark.createDataFrame(
[("Range 1", list([101,105])),
("Range 2", list([200, 203]))],
("Description", "Accounts"))
And your solution is like this:
import pyspark.sql.functions as F
import numpy as np
def make_range_number(arr):
number_range = np.arange(arr[0], arr[1]+1, 1).tolist()
return number_range
range_udf = F.udf(make_range_number)
df = df.withColumn("Range", range_udf(F.col("Accounts")))
Have a fun time!:)
I am using Python 2.7 and Pandas Profiling to generate a report out of a dataframe. Following is my code:
import pandas as pd
import pandas_profiling
# the actual dataset is very large, just providing the two elements of the list
data = [{'polarity': 0.0, 'name': u'danesh bhopi', 'sentiment': 'Neutral', 'tweet_id': 1049952424818020353, 'original_tweet_id': 1049952424818020353, 'created_at': Timestamp('2018-10-10 14:18:59'), 'tweet_text': u"Wouldn't mind aus 120 all-out but before that would like to see a Finch \U0001f4af #PakVAus #AUSvPAK", 'source': u'Twitter for Android', 'location': u'pune', 'retweet_count': 0, 'geo': '', 'favorite_count': 0, 'screen_name': u'DaneshBhope'}, {'polarity': 1.0, 'name': u'kamal Kishor parihar', 'sentiment': 'Positive', 'tweet_id': 1049952403980775425, 'original_tweet_id': 1049952403980775425, 'created_at': Timestamp('2018-10-10 14:18:54'), 'tweet_text': u'#the_summer_game What you and Australia think\nPlay for\n win \nDraw\n or....! #PakvAus', 'source': u'Twitter for Android', 'location': u'chembur Mumbai ', 'retweet_count': 0, 'geo': '', 'favorite_count': 0, 'screen_name': u'kaluparihar1'}]
df = pd.DataFrame(data) #data is a python list containing python dictionaries
pfr = pandas_profiling.ProfileReport(df)
pfr.to_file("df_report.html")
The screenshot of the part of the df_report.html file is below:
As you can see in the image, the Unique(%) field in all the variables is 0.0 although the columns have unique values.
Apart from this, the chart in the 'location' variable is broken. There is no bar for the values 22, 15, 4 and the only bar is for the maximum value only. This is happening in all the variables.
Any help would be appreciated.