Load a txt with unstructured text in Python - pandas

I have found a .txt file with the names of more than 5000 cities around the world. The link is here. The text within is all messy. I would like, in Python, to read the file and store it into a list, so I could search the name of a city whenever I want?
I tried loading it as a dataframe with
import pandas as pd
cities = pd.read_csv('cities15000.txt',error_bad_lines=False)
However, everything looks very messy.
Is there an easier way to achieve this?
Thanks in advance!

The linked file is like a CSV (Comma Separated Values) but instead of commas it uses tabs as the field separator. Set the sep parameter of the pd.read_csv function to \t, i.e. the tab character.
In [18]: import pandas as pd
...:
...: pd.read_csv('cities15000.txt', sep = '\t', header = None)
Out[18]:
0 1 2 3 4 5 ... 13 14 15 16 17 18
0 3040051 les Escaldes les Escaldes Ehskal'des-Ehndzhordani,Escaldes,Escaldes-Engo... 42.50729 1.53414 ... NaN 15853 NaN 1033 Europe/Andorra 2008-10-15
1 3041563 Andorra la Vella Andorra la Vella ALV,Ando-la-Vyey,Andora,Andora la Vela,Andora ... 42.50779 1.52109 ... NaN 20430 NaN 1037 Europe/Andorra 2020-03-03
2 290594 Umm Al Quwain City Umm Al Quwain City Oumm al Qaiwain,Oumm al Qaïwaïn,Um al Kawain,U... 25.56473 55.55517 ... NaN 62747 NaN 2 Asia/Dubai 2019-10-24
3 291074 Ras Al Khaimah City Ras Al Khaimah City Julfa,Khaimah,RAK City,RKT,Ra's al Khaymah,Ra'... 25.78953 55.94320 ... NaN 351943 NaN 2 Asia/Dubai 2019-09-09
4 291580 Zayed City Zayed City Bid' Zayed,Bid’ Zayed,Madinat Za'id,Madinat Za... 23.65416 53.70522 ... NaN 63482 NaN 124 Asia/Dubai 2019-10-24
... ... ... ... ... ... ... ... ... ... .. ... ... ...
24563 894701 Bulawayo Bulawayo BUQ,Bulavajas,Bulavajo,Bulavejo,Bulawayo,bu la... -20.15000 28.58333 ... NaN 699385 NaN 1348 Africa/Harare 2019-09-05
24564 895061 Bindura Bindura Bindura,Bindura Town,Kimberley Reefs,Биндура -17.30192 31.33056 ... NaN 37423 NaN 1118 Africa/Harare 2010-08-03
24565 895269 Beitbridge Beitbridge Bajtbridz,Bajtbridzh,Beitbridge,Beitbridzas,Be... -22.21667 30.00000 ... NaN 26459 NaN 461 Africa/Harare 2013-03-12
24566 1085510 Epworth Epworth Epworth -17.89000 31.14750 ... NaN 123250 NaN 1508 Africa/Harare 2012-01-19
24567 1106542 Chitungwiza Chitungwiza Chitungviza,Chitungwiza,Chytungviza,Citungviza... -18.01274 31.07555 ... NaN 340360 NaN 1435 Africa/Harare 2019-09-05
[24568 rows x 19 columns]

Related

Can someone help me to finish this code with saving the readable dataframe as csv file? I could not save

I am trying to get the data in the dataframe as csv file, but I have always an error. I need a last code to convert the readable data content in python into saved csv file.
Code is here:
from selenium import webdriver
import time
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd`
browser=webdriver.Chrome()
browser.get("https://archive.doingbusiness.org/en/scores")
countries= WebDriverWait(browser, 20).until(EC.visibility_of_element_located((By.XPATH, "//*[#id='dftFrontiner']/div[3]/table"))).get_attribute("outerHTML")
df = pd.read_html(countries)
**df.to_csv('output'.csv, index=False)**
print(df)
time.sleep(2)
browser.quit()
Without the line written in bold, I can get the following output:
[ 0 1 2 3
0 Region Region Region Region
1 NaN East Asia & Pacific 62.7 63.3
2 NaN Europe & Central Asia 71.8 73.1
3 NaN Latin America & Caribbean 58.8 59.1
4 NaN Middle East & North Africa 58.4 60.2
.. ... ... ... ...
217 NaN Vietnam 68.6 69.8
218 NaN West Bank and Gaza 59.7 60
219 NaN Yemen, Rep. 30.7 31.8
220 NaN Zambia 65.7 66.9
221 NaN Zimbabwe 50.5 54.5
When I add the bold line ( df.to_csv('output'.csv, index=False)), I could not save the file. However, I need this data in csv format. Please direct me how to write the code.
Thanks.
That's because pandas.read_html returns a list of DataFrames. So, you need to slice before saving the .csv
Replace:
df.to_csv('output.csv', index=False)
By this :
df[0].to_csv('output.csv', index=False)
# Output :
print(df[0])
0 1 2 3
0 Region Region Region Region
1 NaN East Asia & Pacific 62.7 63.3
2 NaN Europe & Central Asia 71.8 73.1
3 NaN Latin America & Caribbean 58.8 59.1
4 NaN Middle East & North Africa 58.4 60.2
.. ... ... ... ...
217 NaN Vietnam 68.6 69.8
218 NaN West Bank and Gaza 59.7 60
219 NaN Yemen, Rep. 30.7 31.8
220 NaN Zambia 65.7 66.9
221 NaN Zimbabwe 50.5 54.5
[222 rows x 4 columns]
If you need a separated .csv for each group (Region & Economy), use this :
for g in df[0].ffill().groupby(0, sort=False):
sub_df= g[1].reset_index(drop=True).iloc[1:, 1:]
sub_df.to_csv(f"{g[0]}.csv", index=False)

Persisting "nana values" from SQLLIte database

I am trying my best to drop all values from Columns and rows with nan values, or my code will break with these.
Before you ask, yes, I ask Google, and I have the correct code block to drop all nan values.
# ____________________________________________________________________________________ SQLite3 Integration
# Read sqlite query results into a pandas DataFrame
con = sqlite3.connect("Sensors Database.db") # Name of database
df = pd.read_sql_query("SELECT * FROM Hollow_Data_1", con)
# Verify that result of SQL query is stored in the dataframe
con.close()
# print the database table in Console
# drop all nan value
df = df.dropna(subset=['latitude', 'longitude', 'alt'])
print(df)
# df = pd.read_csv('DutchHollow_07222018_non_nan.csv')
The last line is a testing CSV string to ensure my code is working without these nan values.
I only see the same thing when I run the print(df.head()). It's like it never even drops them.
Database snippet:
This database is designed to gather data from sensors from drones, and the nanas values are crap data (the drone is powering on and programming the flight, not in the air yet). My dashboard plots these points to a Mapbox.
This can be handled by NOT NULL in the database schema in real-time, so it doesn't have these NAN values in the table.
Essentially, I want it to drop all rows and columns with NAN values.
I try to copy the dataframe
nan_df = dropna() and dropna(how=all) dropna(subset=['latitude', 'longitude', 'alt']) or (inplace=True)
Every time I print(df.head), it persists. It's like the dropna is not there.
This code will be open source: https://github.com/coxchris859?tab=repositories
Messing around with your data, it appears that your csv is just kinda scuffed.
Instead of true nans, your nans are coming in as nan <-- yes, with a space before it.
Given your df:
gps_date latitude longitude alt ... Temperature Humidity Pressure Voc
0 7/8/2018 20:41 37.7122685 -120.961832 30.4 ... 39.55 1011.68 27.130 1277076.0
1 7/8/2018 20:17 0 0 nan ... 39.66 1014.00 28.967 10943.0
2 7/8/2018 20:17 nan nan nan ... 41.19 1014.02 28.633 15895.0
3 7/8/2018 20:17 nan nan nan ... 42.05 1014.04 27.901 21403.0
4 7/8/2018 20:17 nan nan nan ... 42.49 1014.05 27.169 27909.0
... ... ... ... ... ... ... ... ... ...
4060 7/22/2018 21:50 37.7085305 -121.072975 38.1 ... 42.54 1014.45 22.296 995778.0
4061 7/22/2018 21:50 37.70852517 -121.0729798 38.1 ... 42.53 1014.45 22.305 998589.0
4062 7/22/2018 21:50 37.7085225 -121.0729787 38.2 ... 42.54 1014.44 22.307 999294.0
4063 7/22/2018 21:50 37.70852533 -121.072976 38.4 ... 42.54 1014.45 22.323 1000000.0
4064 7/22/2018 21:50 37.70853217 -121.0729735 38.6 ... 42.54 1014.46 22.323 999294.0
[4065 rows x 21 columns]
Doing:
df = df.replace(' nan', np.nan)
df = df.dropna()
Output:
gps_date latitude longitude alt ... Temperature Humidity Pressure Voc
0 7/8/2018 20:41 37.7122685 -120.961832 30.4 ... 39.55 1011.68 27.130 1277076.0
101 7/8/2018 20:19 37.72486737 -120.9415272 -179.979 ... 42.33 999.77 22.664 511798.0
103 7/8/2018 20:19 37.7193156 -120.9505354 10.642 ... 42.22 999.79 22.596 521619.0
104 7/8/2018 20:19 37.71908237 -120.9503735 1.043 ... 42.12 999.88 22.523 524717.0
105 7/8/2018 20:19 37.71871426 -120.9502485 -11.66 ... 42.03 999.80 22.539 528246.0
... ... ... ... ... ... ... ... ... ...
4060 7/22/2018 21:50 37.7085305 -121.072975 38.1 ... 42.54 1014.45 22.296 995778.0
4061 7/22/2018 21:50 37.70852517 -121.0729798 38.1 ... 42.53 1014.45 22.305 998589.0
4062 7/22/2018 21:50 37.7085225 -121.0729787 38.2 ... 42.54 1014.44 22.307 999294.0
4063 7/22/2018 21:50 37.70852533 -121.072976 38.4 ... 42.54 1014.45 22.323 1000000.0
4064 7/22/2018 21:50 37.70853217 -121.0729735 38.6 ... 42.54 1014.46 22.323 999294.0
[3939 rows x 21 columns]
Alternatively, you can run something like df.latitude = df.latitude.astype(float) on each column that's scuffed, and this appears to auto-fix the scuffed nans and leave you with the correct dtype for those columns.

Data manipulation with time series

My dataframe looks like this:
ID date var1 var2
0 1100289299522 2020-12-01 109.046450 8.0125
1 1100289299522 2020-12-02 104.494946 6.1500
2 1100289299522 2020-12-03 117.011582 5.9375
3 1100289299522 2020-12-04 109.615388 5.4750
4 1100289299522 2020-12-05 142.803438 3.8500
... ... ... ... ...
960045 21380318319578 2021-05-27 7.524261 15.4875
960046 21380318319578 2021-05-28 3.256770 17.3625
960047 21380318319578 2021-05-29 0.561512 18.3250
960048 21380318319578 2021-05-30 1.347629 18.7625
960049 21380318319578 2021-05-31 0.112302 20.0750
Is there a simple way in pandas to have one ID per row and set columns like this:
ID 2020-12-01_var1 2020-12-02_var1 ... 2021-05-31_var1 2020-12-01_var2 2020-12-02_var2 ... 2021-05-31 _var2
1100289299522 109.046450 104.494946 ... ___ 8.0125 6.1500 ... ___
Then i can use a dimensionality reduction algorithm (like TSNE) and maybe classify each time serie (and ID).
Do you think this is the correct way to proceed?
Try:
out = df.pivot(index='ID', columns='date', values=['var1', 'var2'])
out.columns = out.columns.to_flat_index().str.join('_')
For your sample:
>>> out
var1_2020-12-01 var1_2020-12-02 var1_2020-12-03 ... var2_2021-05-29 var2_2021-05-30 var2_2021-05-31
ID ...
1100289299522 109.04645 104.494946 117.011582 ... NaN NaN NaN
21380318319578 NaN NaN NaN ... 18.325 18.7625 20.075
[2 rows x 20 columns]

Using python pandas, how can I merge datasets together and create a column that has the unique modifier? [duplicate]

This question already has answers here:
Convert columns into rows with Pandas
(6 answers)
Closed 1 year ago.
Here is the current dataset that I am working with.
df contains Knn, Kss, and Ktt in three separate columns.
What I have been unable to figure out is how to merge the three into a single column and have a column that has a label.
Here is what I currently have but I
df_CohBeh = pd.concat([pd.DataFrame(Knn),
pd.DataFrame(Kss),
pd.DataFrame(Ktt)],
keys=['Knn', 'Kss', 'Ktt'],
ignore_index=True)
Which looks like this:
display(df_CohBeh)
Knn Kss Ktt
0 24.579131 NaN NaN
1 21.673524 NaN NaN
2 25.785409 NaN NaN
3 20.686215 NaN NaN
4 21.504863 NaN NaN
.. ... ... ...
106 NaN NaN 27.615440
107 NaN NaN 27.636029
108 NaN NaN 26.215347
109 NaN NaN 27.626850
110 NaN NaN 25.473380
Which is in essence filtering them, but I would rather have a single column with a string that I can use for plotting on the same seaborn graph "Knn", "Kss", "Ktt". To look at various distributions.
I'm not sure how to create a column that can label the Knn value in the label column.
If df looks like that:
>>> df
Knn Kss Ktt
0 96.054660 72.301166 15.355594
1 36.221933 72.646999 41.670382
2 96.503307 78.597493 71.959442
3 53.867432 17.315678 35.006592
4 43.014227 75.122762 83.666844
5 63.301808 72.514763 64.597765
6 0.201688 1.069586 98.816202
7 48.558265 87.660352 9.140665
8 64.353999 43.534200 15.202242
9 41.260903 24.128533 25.963022
10 63.571747 17.474933 47.093538
11 91.006290 90.834753 37.672980
12 61.960163 87.308155 64.698762
13 87.403750 86.402637 78.946980
14 22.238364 88.394919 81.935868
15 56.356764 80.575804 72.925204
16 30.431063 4.466978 32.257898
17 21.403800 46.752591 59.831690
18 57.330671 14.172341 64.764542
19 54.163311 66.037043 0.822948
Try df.melt
to merge the three into a single column and have a column that has a label.
variable value
0 Knn 96.054660
1 Knn 36.221933
2 Knn 96.503307
3 Knn 53.867432
4 Knn 43.014227
5 Knn 63.301808
...
20 Kss 72.301166
21 Kss 72.646999
22 Kss 78.597493
23 Kss 17.315678
24 Kss 75.122762
25 Kss 72.514763
...
40 Ktt 15.355594
41 Ktt 41.670382
42 Ktt 71.959442
43 Ktt 35.006592
44 Ktt 83.666844
45 Ktt 64.597765
...
You should use an pandas Series.
knn = pd.DataFram({...})
kss = pd.DataFram({...})
ktt = pd.DataFram({...})
l = knn.values.flatten() + kss.values.flatten() + ktt.values.flatten()
s = pd.Series(l, name="Knn")

Pandas showing Ker Error with mean function

Getting the KeyError: 'BasePay' for the BasePay element while its therein the DataFrame but missing while using mean() function.
My pandas version is '0.23.3' python3.6.3
>>> import numpy as np
>>> salDataF = pd.read_csv('Salaries.csv', low_memory=False)
>>> salDataF.head()
Id EmployeeName JobTitle BasePay OvertimePay OtherPay ... TotalPay TotalPayBenefits Year Notes Agency Status
0 1 NATHANIEL FORD GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY 167411.18 0.0 400184.25 ... 567595.43 567595.43 2011 NaN San Francisco NaN
1 2 GARY JIMENEZ CAPTAIN III (POLICE DEPARTMENT) 155966.02 245131.88 137811.38 ... 538909.28 538909.28 2011 NaN San Francisco NaN
2 3 ALBERT PARDINI CAPTAIN III (POLICE DEPARTMENT) 212739.13 106088.18 16452.6 ... 335279.91 335279.91 2011 NaN San Francisco NaN
3 4 CHRISTOPHER CHONG WIRE ROPE CABLE MAINTENANCE MECHANIC 77916.0 56120.71 198306.9 ... 332343.61 332343.61 2011 NaN San Francisco NaN
4 5 PATRICK GARDNER DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT) 134401.6 9737.0 182234.59 ... 326373.19 326373.19 2011 NaN San Francisco NaN
[5 rows x 13 columns]
>>> EmpSal = salDataF.groupby('Year').mean()
KeyboardInterrupt
>>> salDataF.groupby('Year').mean()
Id TotalPay TotalPayBenefits Notes
Year
2011 18080.0 71744.103871 71744.103871 NaN
2012 54542.5 74113.262265 100553.229232 NaN
2013 91728.5 77611.443142 101440.519714 NaN
2014 129593.0 75463.918140 100250.918884 NaN
>>> EmpSal = salDataF.groupby('Year').mean()['BasePay']
Error: KeyError: 'BasePay'
Here is problem BasePay is not numeric, so salDataF.groupby('Year').mean() exclude all non numeric columns by design.
Solution is first try astype:
salDataF['BasePay'] = salDataF['BasePay'].astype(float)
...and if failed because some non numeric data use to_numeric with errors='coerce' for convert them to NaNs
salDataF['BasePay'] = pd.to_numeric(salDataF['BasePay'], errors='coerce')
and then better is select column before mean:
EmpSal = salDataF.groupby('Year')['BasePay'].mean()