missing data in pandas profiling report - pandas

I am using Python 2.7 and Pandas Profiling to generate a report out of a dataframe. Following is my code:
import pandas as pd
import pandas_profiling
# the actual dataset is very large, just providing the two elements of the list
data = [{'polarity': 0.0, 'name': u'danesh bhopi', 'sentiment': 'Neutral', 'tweet_id': 1049952424818020353, 'original_tweet_id': 1049952424818020353, 'created_at': Timestamp('2018-10-10 14:18:59'), 'tweet_text': u"Wouldn't mind aus 120 all-out but before that would like to see a Finch \U0001f4af #PakVAus #AUSvPAK", 'source': u'Twitter for Android', 'location': u'pune', 'retweet_count': 0, 'geo': '', 'favorite_count': 0, 'screen_name': u'DaneshBhope'}, {'polarity': 1.0, 'name': u'kamal Kishor parihar', 'sentiment': 'Positive', 'tweet_id': 1049952403980775425, 'original_tweet_id': 1049952403980775425, 'created_at': Timestamp('2018-10-10 14:18:54'), 'tweet_text': u'#the_summer_game What you and Australia think\nPlay for\n win \nDraw\n or....! #PakvAus', 'source': u'Twitter for Android', 'location': u'chembur Mumbai ', 'retweet_count': 0, 'geo': '', 'favorite_count': 0, 'screen_name': u'kaluparihar1'}]
df = pd.DataFrame(data) #data is a python list containing python dictionaries
pfr = pandas_profiling.ProfileReport(df)
pfr.to_file("df_report.html")
The screenshot of the part of the df_report.html file is below:
As you can see in the image, the Unique(%) field in all the variables is 0.0 although the columns have unique values.
Apart from this, the chart in the 'location' variable is broken. There is no bar for the values 22, 15, 4 and the only bar is for the maximum value only. This is happening in all the variables.
Any help would be appreciated.

Related

python complex list object to dataframe

I wanted to create a dataframe by expanding the child list object along with parent objects.
Obviously trying pd.DataFrame(lst) does not work as it creates data frame with three columns only and keeps the child object as one column.
Is it possible to do this in one line instead of iterating through list to expand each child object? Thank you in advance.
I have a list object in python like this:
lst = [
{
'id': 'rec1',
'fields': {
'iso': 'US',
'name': 'U S',
'lat': '38.9051',
'lon': '-77.0162'
},
'createdTime': '2021-03-16T13:03:24.000Z'
},
{
'id': 'rec2',
'fields': {'iso': 'HK', 'name': 'China', 'lat': '0.0', 'lon': '0.0'},
'createdTime': '2021-03-16T13:03:24.000Z'
}
]
explected dataframe:
Use json_normalize:
df = pd.json_normalize(lst)
print (df)
id createdTime fields.iso fields.name fields.lat fields.lon
0 rec1 2021-03-16T13:03:24.000Z US U S 38.9051 -77.0162
1 rec2 2021-03-16T13:03:24.000Z HK China 0.0 0.0

folium choropleth returns blank

I have a GeoDataFrame that plots nicely from Geopandas, but returns blank as Choropleth graph in Folium.
Folium 0.7.0
Geopandas 0.5.0
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [12, 12]
radiosambab.plot('situacionpromedio', antialiased=False)
As a geojson
radiosambab.__geo_interface__
returns
{'type': 'FeatureCollection',
'features': [{'id': '020130302',
'type': 'Feature',
'properties': {'situacionpromedio': 1.1173449839705998},
'geometry': {'type': 'Polygon',
'coordinates': (((-58.46738862003677, -34.53484761336359),
(-58.466080612615286, -34.53427219003239),
(-58.46379657486779, -34.53326322986549),
(-58.46165233386257, -34.530802575280035),
(-58.46133757821172, -34.530441540420355),
(-58.4588949370924, -34.527620828300144),
(-58.45884013885469, -34.52762641175383),
(-58.45875915687486, -34.527621382400326),
(-58.458732162886044, -34.52761970593736),
(-58.45867655438868, -34.52763422563783),
(-58.45856182767256, -34.52767203362345),
(-58.45850001004012, -34.52769515425145),
(-58.458440891778, -34.52771844249678),
(-58.45839257108904, -34.52774240132773),
(-58.45834357673059, -34.5277438516926),
...
Calling
radiosambab['situacionpromedio']
returns a Geoseries as expected:
COD_2010
020130302 1.117345
020131101 1.117371
020130104 1.161630
020130102 1.087263
020130101 1.268362
020120405 1.132843
020130107 1.085900
020130106 1.028195
020130109 1.056225
020130111 1.061627
020120407 1.138702
020120404 1.084368
020120402 1.078862
...
But, when invoking folium.Choropleth, it does not work:
m_2 = folium.Map(location=[-34.603722, -58.381592], tiles='openstreetmap', zoom_start=14)
folium.Choropleth(geo_data=radiosambab.__geo_interface__, data=radiosambab['situacionpromedio'], key_on='feature.id', fill_color='YlOrBr').add_to(m_2)
folium.LayerControl().add_to(m_2)
m_2
Returns
Thanks!
Problem seems to be related to lack of memory. Is actually plotting when restricting the number of polygons. But fails to do so above 2000 polygons aprox.
I had a similar problem and solved it with:
var_geodataframe = var_geodataframe.to_crs(epsg=4326)
May be must use the actual version of folium and review the Geopandas.

Grouping and heading pandas dataframe

I have the following dataframe of securities and computed a 'liquidity score' in the last column, where 1 = liquid, 2 = less liquid, and 3 = illiquid. I want to group the securities (dynamically) by their liquidity. Is there a way to group them and include some kind of header for each group? How can this be best achieved. Below is the code and some example, how it is supposed to look like.
import pandas as pd
df = pd.DataFrame({'ID':['XS123', 'US3312', 'DE405'], 'Currency':['EUR', 'EUR', 'USD'], 'Liquidity score':[2,3,1]})
df = df.sort_values(by=["Liquidity score"])
print(df)
# 1 = liquid, 2 = less liquid,, 3 = illiquid
Add labels for liquidity score
The following replaces labels for numbers in Liquidity score:
df['grp'] = df['Liquidity score'].replace({1:'Liquid', 2:'Less liquid', 3:'Illiquid'})
Headers for each group
As per your comment, find below a solution to do this.
Let's illustrate this with a small data example.
df = pd.DataFrame({'ID':['XS223', 'US934', 'US905', 'XS224', 'XS223'], 'Currency':['EUR', 'USD', 'USD','EUR','EUR',]})
Insert a header on specific rows using np.insert.
df = pd.DataFrame(np.insert(df.values, 0, values=["Liquid", ""], axis=0))
df = pd.DataFrame(np.insert(df.values, 2, values=["Less liquid", ""], axis=0))
df.columns = ['ID', 'Currency']
Using Pandas styler, we can add a background color, change font weight to bold and align the text to the left.
df.style.hide_index().set_properties(subset = pd.IndexSlice[[0,2], :], **{'font-weight' : 'bold', 'background-color' : 'lightblue', 'text-align': 'left'})
You can add a new column like this:
df['group'] = np.select(
[
df['Liquidity score'].eq(1),
df['Liquidity score'].eq(2)
],
[
'Liquid','Less liquid'
],
default='Illiquid'
)
And try setting as index, so you can filter using the index:
df.set_index(['grouping','ID'], inplace=True)
df.loc['Less liquid',:]

Pandas Dataframe from a nested dictionary with list as values

I'm newer to python and pandas and I can't figure out a way to push this dict into a dataframe
a_dict = {'position': [{'points': '57.95', 'name': 'Def'}, {'points': '121', 'name': 'PK'}, {'points': '383.1', 'name': 'RB'}, {'points': '299.96', 'name': 'QB'}, {'points': '177.8', 'name': 'TE'}, {'points': '616.42', 'name': 'WR'}], 'id': 'MIN'}
I have tried multiple FOR loops to iterate through the dict but the list keeps me from organizing it. The data is originally in a JSON format. Thank you!
I'm guessing you want the points and names as columns
points = []
name = []
for dct in a_dict['position']:
points.append(dct['points'])
name.append(dct['name'])
pd.DataFrame({'points':points,'name':name})
With the output
points name
0 57.95 Def
1 121 PK
2 383.1 RB
3 299.96 QB
4 177.8 TE
5 616.42 WR

pandas xlsxwriter stacked barchart

I am looking to upload a grouped barchart in excel, however I can't seem to find a way to do so.
Here is my code:
bar_chart2 = workbook.add_chart({'type':'column'})
bar_chart2.add_series({
'name':'Month over month product',
'categories':'=Month over month!$H$2:$H$6',
'values':'=Month over month!$I$2:$J$6',
})
bar_chart2.set_legend({'none': True})
worksheet5.insert_chart('F8',bar_chart2)
bar_chart2.set_legend({'none': True})
worksheet5.insert_chart('F8',bar_chart2)
However, I get that.
Using your provided data, I re-worked the Example given in the Docs by jmcnamara (link here) to suit what you're looking for.
Full Code:
import pandas as pd
import xlsxwriter
headings = [' ', 'Apr 2017', 'May 2017']
data = [
['NGN', 'UGX', 'KES', 'TZS', 'CNY'],
[5816, 1121, 115, 146, 1],
[7089, 1095, 226, 120, 0],
]
#opening workbook
workbook = xlsxwriter.Workbook("test.xlsx")
worksheet5 = workbook.add_worksheet('Month over month')
worksheet5.write_row('H1', headings)
worksheet5.write_column('H2', data[0])
worksheet5.write_column('I2', data[1])
worksheet5.write_column('J2', data[2])
# beginning of OP snippet
bar_chart2 = workbook.add_chart({'type':'column'})
bar_chart2.add_series({
'name': "='Month over month'!$I$1",
'categories': "='Month over month'!$H$2:$H$6",
'values': "='Month over month'!$I$2:$I$6",
})
bar_chart2.add_series({
'name': "='Month over month'!$J$1",
'categories': "='Month over month'!$H$2:$H$6",
'values': "='Month over month'!$J$2:$J$6",
})
bar_chart2.set_title ({'name': 'Month over month product'})
bar_chart2.set_style(11)
#I took the liberty of leaving the legend in there - it was commented in originally
#bar_chart2.set_legend({'none': True})
# end of OP snippet
worksheet5.insert_chart('F8', bar_chart2)
workbook.close()
Output: