How to construct a data frame from raw data from a CSV file - pandas

I am currently learning the python environment to process sensor data.
I have a board with 32 sensors reading temperature. At the following link, you can find an extract of the raw data: https://5e86ea3db5a86.htmlsave.net/
I am trying to construct a data frame grouped by date from my CSV file using pandas (see the potential structure of the table https://docs.google.com/spreadsheets/d/1zpDI7tp4nSn8-Hm3T_xd4Xz7MV6VDGcWGxwNO-8S0-s/edit?usp=sharing
So far, I have read the data file in pandas and delete all the unnamed columns. I am struggling with the creation of a column sensor ID which should contain the 32 sensor ID and the column temperature.
How should I loop through this CSV file to create 3 columns (date, sensor ID and temperature)?
Thanks for the help

It looks like the first item in each line is the date, then there are pairs of sensor id and value, then a blank value that we can exclude. If so, then the following should work. If not, try to modify the code to your purposes.
data = []
with open('filename.txt', 'r') as f:
for line in f:
# the if excludes empty strings
parts = [part for part in line.split(',') if part]
# this gets the date in a format that pandas can recognize
# you can omit the replace operations if not needed
sensor_date = parts[0].strip().replace('[', '').replace(']', '')
# the rest of the list are the parings of sensor and reading
sensor_readings = parts[1:]
# this uses list slicing to iterate over even and odd elements in list
# ::2 means every second item starting with zero, which are evens
# 1::2 means every second item starting with one, which are odds
for sensor, reading in zip(sensor_readings[::2], sensor_readings[1::2]):
data.append({'sensor_date': sensor_date,
'sensor': sensor,
'reading': reading})
pd.DataFrame(data)
Using your sample data, I got the following:
=== Output: ===
Out[64]:
sensor_date sensor reading
0 Tue Jul 02 16:35:22.782 2019 28C037080B000089 16.8750
1 Tue Jul 02 16:35:22.782 2019 284846080B000062 17.0000
2 Tue Jul 02 16:35:22.782 2019 28A4BA070B00002B 16.8750
3 Tue Jul 02 16:35:22.782 2019 28D4E3070B0000D5 16.9375
4 Tue Jul 02 16:35:22.782 2019 28A21E080B00002F 17.0000
.. ... ... ...

Related

Combining multiple dataframe columns into a single time series

I have built a financial model in python where I can enter sales and profit for x years in y scenarios - a base scenario plus however many I add.
Annual figures are uploaded per scenario in my first dataframe (e.g. if x = 5 beginning in 2022 then the base scenario sales column would show figures for 2022, 2023, 2024, 2025 and 2026)
I then use monthly weightings to create a monthly phased sales forecast in a new dataframe with the title Base sales 2022 and figures shown monthly, base sales 2023, base sales 2024 etc
I want to show these figures in a single series, so that I have a single times series for base sales of Jan 2022 to Dec 2026 for charting and analysis purposes.
I've managed to get this to work by creating a list and manually adding the names of each column I want to add but this will not work if I have a different number of scenarios or years so am trying to automate the process but can't find a way where I can do this.
I don't want to share my main model coding but I have created a mini model doing a similar thing below but it doesn't work as although it generates most of the output I want (three lists are requested listA0, listA1, listA2), the lists clearly aren't created as they aren't callable. Also, I really need all the text in a single line rather than split over multiple lines (or perhaps I should use list append for each susbsequent item). Any help gratefully received.
Below is the code I have tried:
#Create list of scenarios and capture the number for use later
Scenlist=["Bad","Very bad","Terrible"]
Scen_number=3
#Create the list of years under assessment and count the number of years
Years=[2020,2021,2022]
Totyrs=len(Years)
#Create the dataframe dprofit and for example purposes create the columns, all showing two datapoints 10 and 10
dprofit=pd.DataFrame()
a=0
b=0
#This creates column names in the format Bad profit 2020, Bad profit 2021 etc
while a<Scen_number:
while b<Totyrs:
dprofit[Scenlist[a]+" profit "+str(Years[b])]=[10,10]
b=b+1
b=0
a=a+1
#Now that the columns have been created print the table
print(dprofit)
#Now create the new table profit2 which will be used to capture the three columns (bad, very bad and terrible) for the full time period by listing the years one after another
dprofit2=pd.DataFrame()
#Create the output to recall the columns from dprofit to combine into 3 lists listA0, list A1 and list A2
a=0
b=0
Totyrs=len(Years)
while a<Scen_number:
while b<Totyrs:
if b==0:
print(f"listA{a}=dprofit[{Scenlist[a]} profit {Years[b]}]")
else:
print(f"+dprofit[{Scenlist[a]} profit {Years[b]}]")
b=b+1
b=0
a=a+1
print(listA0)
#print(list A0) will not call as NameError: name 'listA0' is not defined. Did you mean: 'list'?
To fix the printing you could set the end param to end=''.
while a < Scen_number:
while b < Totyrs:
if b == 0:
print(f"listA{a}=dprofit[{Scenlist[a]} profit {Years[b]}]", end="")
else:
print(f"+dprofit[{Scenlist[a]} profit {Years[b]}]", end="")
results.append([Scenlist[a], Years[b]])
b = b + 1
print()
b = 0
a = a + 1
Output:
listA0=dprofit[Bad profit 2020]+dprofit[Bad profit 2021]+dprofit[Bad profit 2022]
listA1=dprofit[Very bad profit 2020]+dprofit[Very bad profit 2021]+dprofit[Very bad profit 2022]
listA2=dprofit[Terrible profit 2020]+dprofit[Terrible profit 2021]+dprofit[Terrible profit 2022]
To obtain a list or pd.DataFrame of the columns, you could simply filter() for the required columns. No loop required.
listA0 = dprofit.filter(regex="Bad profit", axis=1)
listA1 = dprofit.filter(regex="Very bad profit", axis=1)
listA2 = dprofit.filter(regex="Terrible profit", axis=1)
print(listA1)
Output for listA1:
Very bad profit 2020 Very bad profit 2021 Very bad profit 2022
0 10 10 10
1 10 10 10

Treat Header to Data in Dataframe

I am using a package to read in a table from a pdf. The source table is badly formed so I have a series of inconsistently formatted tables I have to clean on the backend (so reading with header "none" is not an option). The first row, which is data, is being treated as a header. How can I get that first row treated as a data row so I can add a proper header? (output below truncated as it has numerous columns)
**Asia Afghanistan 35,939**
0 Asia Bahrain 972
1 Asia Bhutan 1,910
2 Asia Brunei 111
3 Asia Burma 20,078
4 Asia Cambodia 179,662
Goal is for the "Afganistan" header row to drop to index 0 and then label Continent, Country, Total.
Thanks in advance, this has driven me nuts
Note in request to actual code, see below, the issue is in Tables[1]
import pandas as pd
import tabula
file = "https://travel.state.gov/content/dam/visas/Diversity-Visa/DVStatistics/DV-applicant-entrants-by-country-2019-2021.pdf"
tables = tabula.read_pdf(file, pages = "all", multiple_tables = True)
file = "https://travel.state.gov/content/dam/visas/Diversity-Visa/DVStatistics/DV-applicant-entrants-by-country-2019-2021.pdf"
tables = tabula.read_pdf(file, pages = "all", multiple_tables = True)
tables[1].head()
# note I tried to use zip this only creates a multilevel header, not the desire effect of pushing the current header down as data and adding a new header
ColumnNames =['Region','Foreign State of Chargeability','FY 2019 Entrants','FY 2019 Derivatives','FY 2019 Total','FY 2020 Entrants','FY 2020 Derivatives','FY 2020 Total','FY 2021 Entrants','FY 2021 Derivatives','FY 2021 Total']
tables[1].columns = pd.MultiIndex.from_tuples(
zip(ColumnNames,
tables[1].columns))
tables[1].reset_index(0)
tables[1].head()
OK, I got it. Perhaps not the most elegant solution. Created a one-column data frame from the "Column labels" (which was actually data) where it has the real column labels, added those labes to the original frame(which over writes that row), then concat
ColumnNames =['Region','Foreign State of Chargeability','FY 2019 Entrants','FY 2019 Derivatives','FY 2019 Total','FY 2020 Entrants','FY 2020 Derivatives','FY 2020 Total','FY 2021 Entrants','FY 2021 Derivatives','FY 2021 Total']
First_Row = tables[1].columns.values.tolist()
#make one column dataframe
dTemp= pd.DataFrame(First_Row)
dTemp=dTemp.transpose()
dTemp.columns=ColumnNames
# we can now write over fist column with column labels
tables[1].columns=ColumnNames
tables[1] = pd.concat([dTemp,tables[1]], axis=0,ignore_index=True)

Loop over pandas dataframe to create multiple networks

I have data of countries trade with one another. I have split the main file according to months and got 12 csv files for the year 2019. A sample of the data of January csv is provided below:
reporter partner year month trade
0 Albania Argentina 2019 01 515256
1 Albania Australia 2019 01 398336
2 Albania Austria 2019 01 7664503
3 Albania Bahrain 2019 01 400
4 Albania Bangladesh 2019 01 653907
5 Zimbabwe Zambia 2019 01 79569855
I want to make complex network for every month and print the number of nodes of every network. Now I can do it the hard (stupid) way like so.
df01 = pd.read_csv('012019.csv')
df02 = pd.read_csv('022019.csv')
df03 = pd.read_csv('032019.csv')
df1= df01[['reporter','partner', 'trade']]
df2= df02[['reporter','partner', 'trade']]
df3= df03[['reporter','partner', 'trade']]
G1 = nx.Graph()
G1 = nx.from_pandas_edgelist(df1, 'reporter', 'partner', edge_attr='trade')
G1.number_of_nodes()
and so on for the next networks.
My question is how can I use a "for loop" to read the files, convert them to networks from dataframe and report the number of nodes of each node.
I tried this but nothing is reported.
for f in glob.glob('.csv'):
df = pd.read_csv(f)
df1 = df[['reporter','partner', 'trade']]
G = nx.from_pandas_edgelist(df1, 'reporter', 'partner', edge_attr='trade')
G.number_of_nodes()
Thanks.
Edit:
Ok. So I managed to do the above using similar codes like below:
for files in glob.glob('/home/user/VMShared/network/2nd/*.csv'):
df = pd.read_csv(files)
df1=df[['reporter','partner', 'import']]
G = nx.Graph()
G = nx.from_pandas_edgelist(df1, 'reporter', 'partner', edge_attr='import')
nx.write_graphml_lxml(G, "/home/user/VMShared/network/2nd/*.graphml")
The problem that I now face is how to write separate files. All I get from this is one file titled *.graphml. How can I get graphml files for every input file? Also if I can get the same graphml output name as the input file would be a plus.

How to remove rows in Pandas DataFrame that are partial duplicates?

I have a DataFrame of scraped tweets, and I am trying to remove the rows of tweets that are partial duplicates.
Below is a simplified DataFrame with the same issue. Notice how the first and the last tweet have all but the attached url ending in common; I need a way to drop partial duplicates like this and only keep the latest instance.
data = {
'Tweets':[' The Interstate is closed www.txdot.com/closed',\
'The project is complete www.txdot.com/news',\
'The Interstate is closed www.txdot.com/news'],
'Date': ['Mon Aug 03 20:48:42', 'Mon Aug 03 20:15:42', 'Mon Aug 03 20:01:42' ]
}
df =pd.DataFrame(data)
I've tried dropping duplicates with the drop_duplicates method below, but there doesn't seem to an argument to accomplish this.
df.drop_duplicates(subset=['Tweets'])
Any ideas how to accomplish this?
you can write a regex to remove the slash identify each column by the main url portion and ignore the forward slash.
df['Tweets'].replace('(www\.\w+\.com)/(\w+)',r'\1',regex=True).drop_duplicates()
Yields
0 The Interstate is closed www.txdot.com
1 The project is complete www.txdot.com
Name: Tweets, dtype: object
we can pass the index and create a boolean filter.
df.loc[df['Tweets'].replace('(www\.\w+\.com)/(\w+)',r'\1',regex=True).drop_duplicates().index]
Tweets Date
0 The Interstate is closed www.txdot.com/closed Mon Aug 03 20:48:42
1 The project is complete www.txdot.com/news Mon Aug 03 20:15:42

Mapping column values to a combination of another csv file's information

I have a dataset that indicates date & time in 5-digit format: ddd + hm
ddd part starts from 2009 Jan 1. Since the data was collected only from then to 2-years period, its [min, max] would be [1, 365 x 2 = 730].
Data is observed in 30-min interval, making 24 hrs per day period to lengthen to 48 at max. So [min, max] for hm at [1, 48].
Following is the excerpt of daycode.csv file that contains ddd part of the daycode, matching date & hm part of the daycode, matching time.
And I think I agreed to not showing the dataset which is from ISSDA. So..I will just describe that the daycode in the File1.txt file reads like '63317'.
This link gave me a glimpse of how to approach this problem, and I was in the middle of putting up this code together..which of course won't work at this point.
consume = pd.read_csv("data/File1.txt", sep= ' ', encoding = "utf-8", names =['meter', 'daycode', 'val'])
df1= pd.read_csv("data/daycode.csv", encoding = "cp1252", names =['code', 'print'])
test = consume[consume['meter']==1048]
test['daycode'] = test['daycode'].map(df1.set_index('code')['print'])
plt.plot(test['daycode'], test['val'], '.')
plt.title('test of meter 1048')
plt.xlabel('daycode')
plt.ylabel('energy consumption [kWh]')
plt.show()
Not all units(thousands) have been observed at full length but 730 x 48 is a large combination to lay out on excel by hand. Tbh, not an elegant solution but I tried by dragging - it doesn't quite get it.
If I could read the first 3 digits of the column values and match with another file's column, 2 last digits with another column, then combine.. is there a way?
For the last 2 lines you can just do something like this
df['first_3_digits'] = df['col1'].map(lambda x: str(x)[:3])
df['last_2_digits'] = df['col1'].map(lambda x: str(x)[-2:])
for joining 2 dataframes
df3 = df.merge(df2,left_on=['first_3_digits','last_2_digits'],right_on=['col1_df2','col2_df2'],how='left')