Plotting Webscraped data onto matplotlib - matplotlib

I recently managed to collect tabular data from a PDF file using camelot in python. By collect I mean print it out on the terminal, Now i would like to find a way to automate the results into a bar graph diagram on matplotlib. how would i do that? Here's my code for extracting the tabular data from the pdf:
import camelot
tables = camelot.read_pdf("data_table.pdf", pages='2')
print(tables[0].df)
Here's an image of the table
enter image description here
Which then prints out a large table in my terminal:
0 1 2 3 4
0 Country \nCase definition \nCumulative cases \...
1 Guinea Confirmed 2727 156 1683
2 Probable 374 * 374
3 Suspected 7 * ‡
4 Total 3108 156 2057
5 Liberia** Confirmed 3149 11 ‡
6 Probable 1876 * ‡
7 Suspected 3982 * ‡
8 Total 9007 11 3900
9 Sierra Leone Confirmed 8212 230 3042
10 Probable 287 * 208
11 Suspected 2604 * 158
12 Total 11103 230 3408
13 Total 23 218 397 9365
I do have a bit of experience with matplotlib and i know how to plot data manually but not automatically from the pdf. This would save me some time since I'm trying to automate the whole process.

Related

Pandas extract hierarchical info?

I have a dataframe which describes serial numbers of items arranged in boxes:
df=pd.DataFrame({'barcode':['1000']*3+['2000']*4+['3000']*3, 'box_number': ['10']*2+['11']+['12']*4+['13','14','15'],'serials': map(str,range(800,810))})
barcode box_number serials
0 1000 10 800
1 1000 10 801
2 1000 11 802
3 2000 12 803
4 2000 12 804
5 2000 12 805
6 2000 12 806
7 3000 13 807
8 3000 14 808
9 3000 15 809
I want to group them hierarchically to output to hierarchical XML, so that every barcode has a list of box numbers which each have list of serials in them.
So I did a groupby which seems to do exactly what I want:
df.groupby(['barcode','box_number'])['serials'].apply(' '.join)
barcode box_number
1000 10 800 801
11 802
2000 12 803 804 805 806
3000 13 807
14 808
15 809
Name: serials, dtype: object
Now, I want to extract this info practically the way it is displayed so that I get a row for each barcode with data grouped similar to this:
row['1000']== {'10': '800 801','11':'802'}
row['2000']== {'12': '803 804 805 806'}
row['3000']== {'13': '807','14':'808','15':'809' }
But I can't seem to figure out how to get this done. I tried reset_index(), another groupby() -- but this doesn't work on existing result as it is a Series, but I can't seem to be able to understand the right way.
How should I this most concisely? I looked over questions here, but didn't seem to find similar issue.
Use dictionary comrehension for get nested dictonary with Series.xs and Series.to_dict:
s = df.groupby(['barcode','box_number'])['serials'].apply(' '.join)
d = {lev: s.xs(lev).to_dict() for lev in s.index.levels[0]}
print (d)
{'1000': {'10': '800 801', '11': '802'},
'2000': {'12': '803 804 805 806'},
'3000': {'13': '807', '14': '808', '15': '809'}}

Scraping forum using BeautifulSoup and display in a tabluar form

How do I code BeautifulSoup to display the results in a tabluar format?
something like this:
Topic | Views | Replies
---------------------------------------
XPS 7590 problems | 557 | 8
SSD not working | 76 | 3
My code is:
import requests, re
from bs4 import BeautifulSoup
import pandas as pd
r = requests.get("https://www.dell.com/community/XPS/bd-p/XPS")
soup = BeautifulSoup(r.content)
g_data = soup.find_all("div", {"class": "lia-component-messages-column-thread-info"})
for item in g_data:
print (item.find_all("h2", {"class": "message-subject"})[0].text)
print (item.find_all("span", {"class": "lia-message-stats-count"})[0].text) #replies
print (item.find_all("span", {"class": "lia-message-stats-count"})[1].text) #views
Just construct a dataframe by initializing an empty one and append each "row" into it:
import requests, re
from bs4 import BeautifulSoup
import pandas as pd
r = requests.get("https://www.dell.com/community/XPS/bd-p/XPS")
soup = BeautifulSoup(r.content)
g_data = soup.find_all("div", {"class": "lia-component-messages-column-thread-info"})
df = pd.DataFrame()
for item in g_data:
topic = item.find_all("h2", {"class": "message-subject"})[0].text.strip()
replies = item.find_all("span", {"class": "lia-message-stats-count"})[0].text.strip() #replies
views = item.find_all("span", {"class": "lia-message-stats-count"})[1].text.strip() #views
df = df.append(pd.DataFrame([[topic, views, replies]], columns=['Topic','Views','Replies']), sort=False).reset_index(drop=True)
Output:
print (df)
Topic Views Replies
0 FAQ Modern Standby 1057 0
1 FAQ XPS Laptops 4315 0
2 Where is the Precision Laptops Forum board? 624 0
3 XPS 15-9570, color banding issue 5880 192
4 XPS 7590 problems.. 565 9
5 XPS 13 7390 2-in-1 Display and Touchscreen issues 17 2
6 Dell XPS 9570 I7-8750H video display issues 9 0
7 XPS 9360 Fn lock for PgUp PgDn 12 0
8 Dell XPS DPC Latency Fix 1724 4
9 XPS 13 7390 2-in-1, Realtek drivers lead to fr... 253 11
10 XPS 12 9q23 Touch screen firmware update fix 36 1
11 Dell XPS 15 9570 when HDMI plugged in, screen ... 17 0
12 XPS 13 7390 2 in 1 bluetooth keyboard and mous... 259 10
13 xps15 7590 wifi problem 46 1
14 Unable to update Windows from 1803 to 1909 - X... 52 5
15 Dell XPS 9300 - Thunderbolt 3 Power Delivery I... 28 0
16 Dell XPS 15 9560, right arrow key or right of ... 26 0
17 XPS 13 2020 (9300) Ubuntu sudden shut down 24 0
18 Dell XPS 15 9750 won’t login 26 0
19 XPS 13 9360 Windows Hello Face - reconfigurati... 29 2
20 Enclosure for Dell XPS 13 9360 512 GB pcie nvm... 181 7
21 XPS 13 7390 Firmware 1.3.1 Issue - Bluetooth /... 119 2
22 SSD Onboard? 77 3
23 XPS 13 9350 only turns on when charger connected 4090 11
24 Integrated webcam not working 45 1
25 Docking station for XPS 15 9570, Dell TB16 not... 53 4
26 Dell XPS 13 9370 34 1
27 XPS 13 9380 overheat while charging 602 3
28 DELL XPS 13 (9300) REALTEK AUDIO DRIVER PROBLEM 214 2
29 XPS 15 9570 freezing Windows 10 222 6
30 XPS 13 (9300) - Speaker Vibration 40 2
31 Dell XPS 15 9570 Fingerprint reader not workin... 158 2
32 XPS 9570 Intel 9260 No Bluetooth 34 0

Generate Seaborn Countplot using column value as count

For the following table
count_value
CPUCore Offline_RetentionAge
i7 183 4184
7 1981
30 471
i5 183 2327
7 831
30 250
Pentium 183 333
7 125
30 43
2 183 575
7 236
31 96
Is it possible to generate a seaborn countplot (or normal countplot) like the following (generated using sns.countplot(x='CPUCore', hue="Offline_BackupSchemaIncrementType", data=dfCombined_df))
Problem here is that I need to use the count_value as count, rather then really go and count the Offline_RetentionAge
I think you need seaborn.barplot:
sns.barplot(x="count_value", y="index", hue='Offline_RetentionAge', data=df.reset_index())

Pandas: Error when merging two tables, Error with set_index

Thanks in advance for your help, here's my question:
I've successfully loaded my df in to ipython notebook and then I ran a group by on it:
station_count = station.groupby('landmark').count()
which produced a table like this:
Now I'm trying to merge it with another table:
dock_count_by_station = station.groupby('landmark').sum()
that is also a simple group by on the same table, but the merge produces an error:
TypeError: cannot concatenate a non-NDFrame object
with this code:
dock_count_by_station.merge(station_count)
I think the problem is that I need to set the index of the two tables before merging them but I keep getting this error for the code below:
pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3979)()
pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3843)()
pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12265)()
pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12216)()
KeyError: 'landmark'
station_count.set_index('landmark')
Using join
You can use join, which merges the tables on their index. You may also wish to specify the join type (e.g. 'outer', 'inner', 'left' or 'right'). You have overlapping column names (e.g. station_id), so you need to specify a suffix.
>>> dock_count_by_station.join(station_count, rsuffix='_rhs')
dockcount lat long station_id dockcount_rhs installation lat_rhs long_rhs name station_id_rhs
landmark
Mountain View 117 261.767433 -854.623012 210 7 7 7 7 7 7
Palo Alto 75 187.191873 -610.767939 180 5 5 5 5 5 5
Redwood City 115 262.406232 -855.602755 224 7 7 7 7 7 7
San Francisco 665 1322.569239 -4284.054814 2126 35 35 35 35 35 35
San Jose 249 560.039892 -1828.370075 200 15 15 15 15 15 15
Using merge
Note that your landmark index was set by default when you did the groupby. You can always use as_index=False if you don't want this to occur, but then you would have to use merge instead of join.
dock_count_by_station = station.groupby('landmark', as_index=False).sum()
station_count = station.groupby('landmark', as_index=False).count()
>>> dock_count_by_station.merge(station_count, on='landmark', suffixes=['_lhs', '_rhs'])
landmark dockcount_lhs lat_lhs long_lhs station_id_lhs dockcount_rhs installation lat_rhs long_rhs name station_id_rhs
0 Mountain View 117 261.767433 -854.623012 210 7 7 7 7 7 7
1 Palo Alto 75 187.191873 -610.767939 180 5 5 5 5 5 5
2 Redwood City 115 262.406232 -855.602755 224 7 7 7 7 7 7
3 San Francisco 665 1322.569239 -4284.054814 2126 35 35 35 35 35 35
4 San Jose 249 560.039892 -1828.370075 200 15 15 15 15 15 15

SQL: How to remove & update records in a table, sync'ing with a (Q)List data in order to decrease the burden on database/table?

I am using Qt and MS-Sql Server on Windows7 OS.
What I have is an MS-SQL database that I use to store data/info coming from equipment that is mounted in some vehicles.
There is a table in the database named TransactionFilesInfo - a table used to store information about transaction files from the equipment, when they connect to the TCP-server.
We are using this table as we are requested to avoid duplicate files. It happens (sometimes) when the remote equipment does NOT delete the transaction files after they are sent to the server. Hence, I use the info from the table to check [size and CRC] to avoid downloading duplicates.
Some sample data for TransactionFilesInfo table looks like this:
[TransactionFilesInfo]:
DeviceID FileNo FileSequence FileSize FileCRC RecordTimeStamp
10203 2 33 230 55384 2015-11-26 14:54:15
10203 7 33 624 55391 2015-11-26 14:54:15
10203 2 34 146 21505 2015-11-26 14:54:16
10203 7 34 312 35269 2015-11-26 14:54:16
10203 2 35 206 23022 2015-11-26 15:33:22
10203 7 35 208 11091 2015-11-26 15:33:22
10203 2 36 134 34918 2015-11-26 15:55:44
10203 7 36 104 63865 2015-11-26 15:55:44
10203 2 37 140 35466 2015-11-26 16:20:38
10203 7 37 208 62907 2015-11-26 16:20:38
10203 2 38 134 17706 2015-11-26 16:38:33
10203 7 38 104 42358 2015-11-26 16:38:33
11511 2 21 194 29913 2015-12-02 16:22:59
11511 7 21 114 30038 2015-12-02 16:22:59
On the other hand, every time a device connects to the server, it first sends a list of file information. The Qt application takes care of that.
The list contains elements like this:
struct FileInfo
{
unsigned short FileNumber;
unsigned short FileSequence;
unsigned short FileCRC;
unsigned long FileSize;
};
So, as an example (inspired by the table above) the connected device (DeviceID=10203) may say that it has the following files:
QList<FileInfo> filesList;
// here is the log4qt output...
filesList[0] --> FileNo=2 FileSeq=33 FileSize=230 and FileCRC=55384
filesList[1] --> FileNo=2 FileSeq=34 FileSize=146 and FileCRC=21505
filesList[2] --> FileNo=7 FileSeq=33 FileSize=624 and FileCRC=55391
filesList[3] --> FileNo=7 FileSeq=34 FileSize=312 and FileCRC=35269 ...
Well, what I need is a method to remove/delete, for a given DeviceID, all the records in the TransactionFilesInfo table, records that are NOT in the list sent by the remote device. Hence, I will be able to decrease the burden (size) on the database table.
Remark: For the moment I just delete (#midnight) all the records that are older than let's say 10 days, based on RecordTimeStamp field. So, the size of the table doesn't increase over an alarming level :)
Finally, to clarify it a little bit: I would mainly need help with SQL. Yet, I would not refuse any idea on how to do some related things/tricks on the Qt side ;)
The SQL to delete those records might look something like this:
DELETE FROM [SAMPLE DATA]
WHERE DeviceID = 10203
and 'File' + CONVERT(varchar(11),FileNo) + '_' +
RIGHT('000' + CONVERT(varchar(11),FileSequence),3)
NOT IN ('File2_033','File2_034','File7_033','File7_034',...)
If you wanted to delete all of them for a device, you could drop the code that looks at the FileNo and FileSequence so it is simply:
DELETE FROM [SAMPLE DATA]
WHERE DeviceID = 10203