How to remove rows in Pandas DataFrame that are partial duplicates? - pandas

I have a DataFrame of scraped tweets, and I am trying to remove the rows of tweets that are partial duplicates.
Below is a simplified DataFrame with the same issue. Notice how the first and the last tweet have all but the attached url ending in common; I need a way to drop partial duplicates like this and only keep the latest instance.
data = {
'Tweets':[' The Interstate is closed www.txdot.com/closed',\
'The project is complete www.txdot.com/news',\
'The Interstate is closed www.txdot.com/news'],
'Date': ['Mon Aug 03 20:48:42', 'Mon Aug 03 20:15:42', 'Mon Aug 03 20:01:42' ]
}
df =pd.DataFrame(data)
I've tried dropping duplicates with the drop_duplicates method below, but there doesn't seem to an argument to accomplish this.
df.drop_duplicates(subset=['Tweets'])
Any ideas how to accomplish this?

you can write a regex to remove the slash identify each column by the main url portion and ignore the forward slash.
df['Tweets'].replace('(www\.\w+\.com)/(\w+)',r'\1',regex=True).drop_duplicates()
Yields
0 The Interstate is closed www.txdot.com
1 The project is complete www.txdot.com
Name: Tweets, dtype: object
we can pass the index and create a boolean filter.
df.loc[df['Tweets'].replace('(www\.\w+\.com)/(\w+)',r'\1',regex=True).drop_duplicates().index]
Tweets Date
0 The Interstate is closed www.txdot.com/closed Mon Aug 03 20:48:42
1 The project is complete www.txdot.com/news Mon Aug 03 20:15:42

Related

Combining multiple dataframe columns into a single time series

I have built a financial model in python where I can enter sales and profit for x years in y scenarios - a base scenario plus however many I add.
Annual figures are uploaded per scenario in my first dataframe (e.g. if x = 5 beginning in 2022 then the base scenario sales column would show figures for 2022, 2023, 2024, 2025 and 2026)
I then use monthly weightings to create a monthly phased sales forecast in a new dataframe with the title Base sales 2022 and figures shown monthly, base sales 2023, base sales 2024 etc
I want to show these figures in a single series, so that I have a single times series for base sales of Jan 2022 to Dec 2026 for charting and analysis purposes.
I've managed to get this to work by creating a list and manually adding the names of each column I want to add but this will not work if I have a different number of scenarios or years so am trying to automate the process but can't find a way where I can do this.
I don't want to share my main model coding but I have created a mini model doing a similar thing below but it doesn't work as although it generates most of the output I want (three lists are requested listA0, listA1, listA2), the lists clearly aren't created as they aren't callable. Also, I really need all the text in a single line rather than split over multiple lines (or perhaps I should use list append for each susbsequent item). Any help gratefully received.
Below is the code I have tried:
#Create list of scenarios and capture the number for use later
Scenlist=["Bad","Very bad","Terrible"]
Scen_number=3
#Create the list of years under assessment and count the number of years
Years=[2020,2021,2022]
Totyrs=len(Years)
#Create the dataframe dprofit and for example purposes create the columns, all showing two datapoints 10 and 10
dprofit=pd.DataFrame()
a=0
b=0
#This creates column names in the format Bad profit 2020, Bad profit 2021 etc
while a<Scen_number:
while b<Totyrs:
dprofit[Scenlist[a]+" profit "+str(Years[b])]=[10,10]
b=b+1
b=0
a=a+1
#Now that the columns have been created print the table
print(dprofit)
#Now create the new table profit2 which will be used to capture the three columns (bad, very bad and terrible) for the full time period by listing the years one after another
dprofit2=pd.DataFrame()
#Create the output to recall the columns from dprofit to combine into 3 lists listA0, list A1 and list A2
a=0
b=0
Totyrs=len(Years)
while a<Scen_number:
while b<Totyrs:
if b==0:
print(f"listA{a}=dprofit[{Scenlist[a]} profit {Years[b]}]")
else:
print(f"+dprofit[{Scenlist[a]} profit {Years[b]}]")
b=b+1
b=0
a=a+1
print(listA0)
#print(list A0) will not call as NameError: name 'listA0' is not defined. Did you mean: 'list'?
To fix the printing you could set the end param to end=''.
while a < Scen_number:
while b < Totyrs:
if b == 0:
print(f"listA{a}=dprofit[{Scenlist[a]} profit {Years[b]}]", end="")
else:
print(f"+dprofit[{Scenlist[a]} profit {Years[b]}]", end="")
results.append([Scenlist[a], Years[b]])
b = b + 1
print()
b = 0
a = a + 1
Output:
listA0=dprofit[Bad profit 2020]+dprofit[Bad profit 2021]+dprofit[Bad profit 2022]
listA1=dprofit[Very bad profit 2020]+dprofit[Very bad profit 2021]+dprofit[Very bad profit 2022]
listA2=dprofit[Terrible profit 2020]+dprofit[Terrible profit 2021]+dprofit[Terrible profit 2022]
To obtain a list or pd.DataFrame of the columns, you could simply filter() for the required columns. No loop required.
listA0 = dprofit.filter(regex="Bad profit", axis=1)
listA1 = dprofit.filter(regex="Very bad profit", axis=1)
listA2 = dprofit.filter(regex="Terrible profit", axis=1)
print(listA1)
Output for listA1:
Very bad profit 2020 Very bad profit 2021 Very bad profit 2022
0 10 10 10
1 10 10 10

Treat Header to Data in Dataframe

I am using a package to read in a table from a pdf. The source table is badly formed so I have a series of inconsistently formatted tables I have to clean on the backend (so reading with header "none" is not an option). The first row, which is data, is being treated as a header. How can I get that first row treated as a data row so I can add a proper header? (output below truncated as it has numerous columns)
**Asia Afghanistan 35,939**
0 Asia Bahrain 972
1 Asia Bhutan 1,910
2 Asia Brunei 111
3 Asia Burma 20,078
4 Asia Cambodia 179,662
Goal is for the "Afganistan" header row to drop to index 0 and then label Continent, Country, Total.
Thanks in advance, this has driven me nuts
Note in request to actual code, see below, the issue is in Tables[1]
import pandas as pd
import tabula
file = "https://travel.state.gov/content/dam/visas/Diversity-Visa/DVStatistics/DV-applicant-entrants-by-country-2019-2021.pdf"
tables = tabula.read_pdf(file, pages = "all", multiple_tables = True)
file = "https://travel.state.gov/content/dam/visas/Diversity-Visa/DVStatistics/DV-applicant-entrants-by-country-2019-2021.pdf"
tables = tabula.read_pdf(file, pages = "all", multiple_tables = True)
tables[1].head()
# note I tried to use zip this only creates a multilevel header, not the desire effect of pushing the current header down as data and adding a new header
ColumnNames =['Region','Foreign State of Chargeability','FY 2019 Entrants','FY 2019 Derivatives','FY 2019 Total','FY 2020 Entrants','FY 2020 Derivatives','FY 2020 Total','FY 2021 Entrants','FY 2021 Derivatives','FY 2021 Total']
tables[1].columns = pd.MultiIndex.from_tuples(
zip(ColumnNames,
tables[1].columns))
tables[1].reset_index(0)
tables[1].head()
OK, I got it. Perhaps not the most elegant solution. Created a one-column data frame from the "Column labels" (which was actually data) where it has the real column labels, added those labes to the original frame(which over writes that row), then concat
ColumnNames =['Region','Foreign State of Chargeability','FY 2019 Entrants','FY 2019 Derivatives','FY 2019 Total','FY 2020 Entrants','FY 2020 Derivatives','FY 2020 Total','FY 2021 Entrants','FY 2021 Derivatives','FY 2021 Total']
First_Row = tables[1].columns.values.tolist()
#make one column dataframe
dTemp= pd.DataFrame(First_Row)
dTemp=dTemp.transpose()
dTemp.columns=ColumnNames
# we can now write over fist column with column labels
tables[1].columns=ColumnNames
tables[1] = pd.concat([dTemp,tables[1]], axis=0,ignore_index=True)

Python pandas - change last hour of a day 00 with 24

I'm currently working on setting up a pandas dataframe. I have a date time column, which should be splitted onto date and time columns.
I have reached to split this column to date + time columns like this:
Date Hour
22 20200205 23
23 20200205 00
What I want is to replace 00 value with 24. I know Python does not recognize 24th hour as the last our, but for my purposes, I need to include "24" hour.
I'm pretty new to python and pandas and I'm confused about replacing this value.
I tried to use below code line but with no luck:
frame= OMPdata
OMPdata['Hour'] = OMPdata['Hour'].str.replace(00,24, case= False)
or function:
def h24 (row):
if row['Hour'] == "00":
return "24"
else:
return row['Hour']
Please help me with this issue.
Thanks a lot!
Try to first convert to string the column and then replace:
df['Hour'] = df['Hour'].astype(str)
df['Hour'] = df['Hour'].str.replace("00","24", case= False)

How to construct a data frame from raw data from a CSV file

I am currently learning the python environment to process sensor data.
I have a board with 32 sensors reading temperature. At the following link, you can find an extract of the raw data: https://5e86ea3db5a86.htmlsave.net/
I am trying to construct a data frame grouped by date from my CSV file using pandas (see the potential structure of the table https://docs.google.com/spreadsheets/d/1zpDI7tp4nSn8-Hm3T_xd4Xz7MV6VDGcWGxwNO-8S0-s/edit?usp=sharing
So far, I have read the data file in pandas and delete all the unnamed columns. I am struggling with the creation of a column sensor ID which should contain the 32 sensor ID and the column temperature.
How should I loop through this CSV file to create 3 columns (date, sensor ID and temperature)?
Thanks for the help
It looks like the first item in each line is the date, then there are pairs of sensor id and value, then a blank value that we can exclude. If so, then the following should work. If not, try to modify the code to your purposes.
data = []
with open('filename.txt', 'r') as f:
for line in f:
# the if excludes empty strings
parts = [part for part in line.split(',') if part]
# this gets the date in a format that pandas can recognize
# you can omit the replace operations if not needed
sensor_date = parts[0].strip().replace('[', '').replace(']', '')
# the rest of the list are the parings of sensor and reading
sensor_readings = parts[1:]
# this uses list slicing to iterate over even and odd elements in list
# ::2 means every second item starting with zero, which are evens
# 1::2 means every second item starting with one, which are odds
for sensor, reading in zip(sensor_readings[::2], sensor_readings[1::2]):
data.append({'sensor_date': sensor_date,
'sensor': sensor,
'reading': reading})
pd.DataFrame(data)
Using your sample data, I got the following:
=== Output: ===
Out[64]:
sensor_date sensor reading
0 Tue Jul 02 16:35:22.782 2019 28C037080B000089 16.8750
1 Tue Jul 02 16:35:22.782 2019 284846080B000062 17.0000
2 Tue Jul 02 16:35:22.782 2019 28A4BA070B00002B 16.8750
3 Tue Jul 02 16:35:22.782 2019 28D4E3070B0000D5 16.9375
4 Tue Jul 02 16:35:22.782 2019 28A21E080B00002F 17.0000
.. ... ... ...

Twitter api python - max number of objects

so I am calling the twitter api:
openurl = urllib.urlopen("https://api.twitter.com/1/statuses/user_timeline.json?include_entities=true&contributor_details&include_rts=true&screen_name="+user+"&count=3600")
and it returns some long file like:
[{"entities":{"hashtags":[],"user_mentions":[],"urls":[{"url":"http:\/\/t.co\/Hd1ubDVX","indices":[115,135],"display_url":"amzn.to\/tPSKgf","expanded_url":"http:\/\/amzn.to\/tPSKgf"}]},"coordinates":null,"truncated":false,"place":null,"geo":null,"in_reply_to_user_id":null,"retweet_count":2,"favorited":false,"in_reply_to_status_id_str":null,"user":{"contributors_enabled":false,"lang":"en","profile_background_image_url_https":"https:\/\/si0.twimg.com\/profile_background_images\/151701304\/theme14.gif","favourites_count":0,"profile_text_color":"333333","protected":false,"location":"North America","is_translator":false,"profile_background_image_url":"http:\/\/a0.twimg.com\/profile_background_images\/151701304\/theme14.gif","profile_image_url_https":"https:\/\/si0.twimg.com\/profile_images\/1642783876\/idB005XNC8Z4_normal.png","name":"User Interface Books","profile_link_color":"009999","url":"http:\/\/twitter.com\/ReleasedBooks\/genres","utc_offset":-28800,"description":"All new user interface and graphic design book releases posted on their publication day","listed_count":11,"profile_background_color":"131516","statuses_count":1189,"following":false,"profile_background_tile":true,"followers_count":732,"profile_image_url":"http:\/\/a2.twimg.com\/profile_images\/1642783876\/idB005XNC8Z4_normal.png","default_profile":false,"geo_enabled":false,"created_at":"Mon Sep 20 21:28:15 +0000 2010","profile_sidebar_fill_color":"efefef","show_all_inline_media":false,"follow_request_sent":false,"notifications":false,"friends_count":1,"profile_sidebar_border_color":"eeeeee","screen_name":"User","id_str":"193056806","verified":false,"id":193056806,"default_profile_image":false,"profile_use_background_image":true,"time_zone":"Pacific Time (US & Canada)"},"possibly_sensitive":false,"in_reply_to_screen_name":null,"created_at":"Thu Nov 17 00:01:45 +0000 2011","in_reply_to_user_id_str":null,"retweeted":false,"source":"\u003Ca href=\"http:\/\/twitter.com\/ReleasedBooks\/genres\" rel=\"nofollow\"\u003EBook Releases\u003C\/a\u003E","id_str":"136957158075011072","in_reply_to_status_id":null,"id":136957158075011072,"contributors":null,"text":"Digital Media: Technological and Social Challenges of the Interactive World - by William Aspray - Scarecrow Press. http:\/\/t.co\/Hd1ubDVX"},{"entities":{"hashtags":[],"user_mentions":[],"urls":[{"url":"http:\/\/t.co\/GMCzTija","indices":[119,139],"display_u
Well,
the different objects are slit into tables and dictionaries and I want to extract the different parts but to do this I have to know how many objects the file has:
example:
[{1:info , 2:info}][{1:info , 2:info}][{1:info , 2:info}][{1:info , 2:info}]
so to extract the info from 1 in the first table I would:
[0]['1']
>>>>info
But to extract it from the last object in the table I need to know how many object the table has.
This is what my code looks like:
table_timeline = json.loads(twitter_timeline)
table_timeline_inner = table_timeline[x]
lines = 0
while lines < linesmax:
in_reply_to_user_id = table_timeline_inner['in_reply_to_status_id_str']
lines += 1
So how do I find the value of the last object in this table?
thanks
I'm not entirely sure this is what you're looking for, but to get the last item in a python list, use an index of -1. For example,
>>> alist = [{'position': 'first'}, {'position': 'second'}, {'position': 'third'}]
>>> print alist[-1]['position']
{'position': 'third'}
>>> print alist[-1]['position']
third