Treat Header to Data in Dataframe - pandas

I am using a package to read in a table from a pdf. The source table is badly formed so I have a series of inconsistently formatted tables I have to clean on the backend (so reading with header "none" is not an option). The first row, which is data, is being treated as a header. How can I get that first row treated as a data row so I can add a proper header? (output below truncated as it has numerous columns)
**Asia Afghanistan 35,939**
0 Asia Bahrain 972
1 Asia Bhutan 1,910
2 Asia Brunei 111
3 Asia Burma 20,078
4 Asia Cambodia 179,662
Goal is for the "Afganistan" header row to drop to index 0 and then label Continent, Country, Total.
Thanks in advance, this has driven me nuts
Note in request to actual code, see below, the issue is in Tables[1]
import pandas as pd
import tabula
file = "https://travel.state.gov/content/dam/visas/Diversity-Visa/DVStatistics/DV-applicant-entrants-by-country-2019-2021.pdf"
tables = tabula.read_pdf(file, pages = "all", multiple_tables = True)
file = "https://travel.state.gov/content/dam/visas/Diversity-Visa/DVStatistics/DV-applicant-entrants-by-country-2019-2021.pdf"
tables = tabula.read_pdf(file, pages = "all", multiple_tables = True)
tables[1].head()
# note I tried to use zip this only creates a multilevel header, not the desire effect of pushing the current header down as data and adding a new header
ColumnNames =['Region','Foreign State of Chargeability','FY 2019 Entrants','FY 2019 Derivatives','FY 2019 Total','FY 2020 Entrants','FY 2020 Derivatives','FY 2020 Total','FY 2021 Entrants','FY 2021 Derivatives','FY 2021 Total']
tables[1].columns = pd.MultiIndex.from_tuples(
zip(ColumnNames,
tables[1].columns))
tables[1].reset_index(0)
tables[1].head()

OK, I got it. Perhaps not the most elegant solution. Created a one-column data frame from the "Column labels" (which was actually data) where it has the real column labels, added those labes to the original frame(which over writes that row), then concat
ColumnNames =['Region','Foreign State of Chargeability','FY 2019 Entrants','FY 2019 Derivatives','FY 2019 Total','FY 2020 Entrants','FY 2020 Derivatives','FY 2020 Total','FY 2021 Entrants','FY 2021 Derivatives','FY 2021 Total']
First_Row = tables[1].columns.values.tolist()
#make one column dataframe
dTemp= pd.DataFrame(First_Row)
dTemp=dTemp.transpose()
dTemp.columns=ColumnNames
# we can now write over fist column with column labels
tables[1].columns=ColumnNames
tables[1] = pd.concat([dTemp,tables[1]], axis=0,ignore_index=True)

Related

Import/Insert Excel Range and SSIS variables into SQL table?

I have an SSIS package that is to ingest a number of Excel files with similar structures but irregular names and import them into a SQL table. Along with the data from the excel files, I have a number of variables that are set and different with each file (User::ExcelFileName, User::VarMonth, User::VarProgram, User::VarYear, etc). All of the table data from the Excel files are going to the same destination table, but for each row of data alongside the Excel dataset I want to insert a column for each variable to pass through as well into SQL. An example of my dataset is below:
Excel
ID
Name
Foo
Bar
111
Bob
88yu
117
112
Jim
JKL
A TU
113
George
FTD
19900
SSIS Variables (set during execution)
User::ExcelFileName = c:\temp\excelfile1.xlsx
User::VarMonth = Jan
User::VarProgram = Daily
User::VarYear = 2023
Desired SQL Destination:
ExcelFileName
VarMonth
VarProgram
VarYear
ID
Name
Foo
Bar
c:\temp\excelfile1.xlsx
Jan
Daily
2023
111
Bob
88yu
117
c:\temp\excelfile1.xlsx
Jan
Daily
2023
112
Jim
JKL
A TU
c:\temp\excelfile1.xlsx
Jan
Daily
2023
113
George
FTD
19900
I've tried a few configurations and I've referenced this post for piping in variable data into SQL, but I haven't gotten a working model yet.
Worth noting, Excel COnnection is dynamic and set to run within a Foreach Loop container to iterate through my Excel sources. Any advice or guidance would be appreciated!
It sounds like you want a Derived Column task.
in the task, just add the new columns you want, and map the variables to the column.

Combining multiple dataframe columns into a single time series

I have built a financial model in python where I can enter sales and profit for x years in y scenarios - a base scenario plus however many I add.
Annual figures are uploaded per scenario in my first dataframe (e.g. if x = 5 beginning in 2022 then the base scenario sales column would show figures for 2022, 2023, 2024, 2025 and 2026)
I then use monthly weightings to create a monthly phased sales forecast in a new dataframe with the title Base sales 2022 and figures shown monthly, base sales 2023, base sales 2024 etc
I want to show these figures in a single series, so that I have a single times series for base sales of Jan 2022 to Dec 2026 for charting and analysis purposes.
I've managed to get this to work by creating a list and manually adding the names of each column I want to add but this will not work if I have a different number of scenarios or years so am trying to automate the process but can't find a way where I can do this.
I don't want to share my main model coding but I have created a mini model doing a similar thing below but it doesn't work as although it generates most of the output I want (three lists are requested listA0, listA1, listA2), the lists clearly aren't created as they aren't callable. Also, I really need all the text in a single line rather than split over multiple lines (or perhaps I should use list append for each susbsequent item). Any help gratefully received.
Below is the code I have tried:
#Create list of scenarios and capture the number for use later
Scenlist=["Bad","Very bad","Terrible"]
Scen_number=3
#Create the list of years under assessment and count the number of years
Years=[2020,2021,2022]
Totyrs=len(Years)
#Create the dataframe dprofit and for example purposes create the columns, all showing two datapoints 10 and 10
dprofit=pd.DataFrame()
a=0
b=0
#This creates column names in the format Bad profit 2020, Bad profit 2021 etc
while a<Scen_number:
while b<Totyrs:
dprofit[Scenlist[a]+" profit "+str(Years[b])]=[10,10]
b=b+1
b=0
a=a+1
#Now that the columns have been created print the table
print(dprofit)
#Now create the new table profit2 which will be used to capture the three columns (bad, very bad and terrible) for the full time period by listing the years one after another
dprofit2=pd.DataFrame()
#Create the output to recall the columns from dprofit to combine into 3 lists listA0, list A1 and list A2
a=0
b=0
Totyrs=len(Years)
while a<Scen_number:
while b<Totyrs:
if b==0:
print(f"listA{a}=dprofit[{Scenlist[a]} profit {Years[b]}]")
else:
print(f"+dprofit[{Scenlist[a]} profit {Years[b]}]")
b=b+1
b=0
a=a+1
print(listA0)
#print(list A0) will not call as NameError: name 'listA0' is not defined. Did you mean: 'list'?
To fix the printing you could set the end param to end=''.
while a < Scen_number:
while b < Totyrs:
if b == 0:
print(f"listA{a}=dprofit[{Scenlist[a]} profit {Years[b]}]", end="")
else:
print(f"+dprofit[{Scenlist[a]} profit {Years[b]}]", end="")
results.append([Scenlist[a], Years[b]])
b = b + 1
print()
b = 0
a = a + 1
Output:
listA0=dprofit[Bad profit 2020]+dprofit[Bad profit 2021]+dprofit[Bad profit 2022]
listA1=dprofit[Very bad profit 2020]+dprofit[Very bad profit 2021]+dprofit[Very bad profit 2022]
listA2=dprofit[Terrible profit 2020]+dprofit[Terrible profit 2021]+dprofit[Terrible profit 2022]
To obtain a list or pd.DataFrame of the columns, you could simply filter() for the required columns. No loop required.
listA0 = dprofit.filter(regex="Bad profit", axis=1)
listA1 = dprofit.filter(regex="Very bad profit", axis=1)
listA2 = dprofit.filter(regex="Terrible profit", axis=1)
print(listA1)
Output for listA1:
Very bad profit 2020 Very bad profit 2021 Very bad profit 2022
0 10 10 10
1 10 10 10

Loop over pandas dataframe to create multiple networks

I have data of countries trade with one another. I have split the main file according to months and got 12 csv files for the year 2019. A sample of the data of January csv is provided below:
reporter partner year month trade
0 Albania Argentina 2019 01 515256
1 Albania Australia 2019 01 398336
2 Albania Austria 2019 01 7664503
3 Albania Bahrain 2019 01 400
4 Albania Bangladesh 2019 01 653907
5 Zimbabwe Zambia 2019 01 79569855
I want to make complex network for every month and print the number of nodes of every network. Now I can do it the hard (stupid) way like so.
df01 = pd.read_csv('012019.csv')
df02 = pd.read_csv('022019.csv')
df03 = pd.read_csv('032019.csv')
df1= df01[['reporter','partner', 'trade']]
df2= df02[['reporter','partner', 'trade']]
df3= df03[['reporter','partner', 'trade']]
G1 = nx.Graph()
G1 = nx.from_pandas_edgelist(df1, 'reporter', 'partner', edge_attr='trade')
G1.number_of_nodes()
and so on for the next networks.
My question is how can I use a "for loop" to read the files, convert them to networks from dataframe and report the number of nodes of each node.
I tried this but nothing is reported.
for f in glob.glob('.csv'):
df = pd.read_csv(f)
df1 = df[['reporter','partner', 'trade']]
G = nx.from_pandas_edgelist(df1, 'reporter', 'partner', edge_attr='trade')
G.number_of_nodes()
Thanks.
Edit:
Ok. So I managed to do the above using similar codes like below:
for files in glob.glob('/home/user/VMShared/network/2nd/*.csv'):
df = pd.read_csv(files)
df1=df[['reporter','partner', 'import']]
G = nx.Graph()
G = nx.from_pandas_edgelist(df1, 'reporter', 'partner', edge_attr='import')
nx.write_graphml_lxml(G, "/home/user/VMShared/network/2nd/*.graphml")
The problem that I now face is how to write separate files. All I get from this is one file titled *.graphml. How can I get graphml files for every input file? Also if I can get the same graphml output name as the input file would be a plus.

How to construct a data frame from raw data from a CSV file

I am currently learning the python environment to process sensor data.
I have a board with 32 sensors reading temperature. At the following link, you can find an extract of the raw data: https://5e86ea3db5a86.htmlsave.net/
I am trying to construct a data frame grouped by date from my CSV file using pandas (see the potential structure of the table https://docs.google.com/spreadsheets/d/1zpDI7tp4nSn8-Hm3T_xd4Xz7MV6VDGcWGxwNO-8S0-s/edit?usp=sharing
So far, I have read the data file in pandas and delete all the unnamed columns. I am struggling with the creation of a column sensor ID which should contain the 32 sensor ID and the column temperature.
How should I loop through this CSV file to create 3 columns (date, sensor ID and temperature)?
Thanks for the help
It looks like the first item in each line is the date, then there are pairs of sensor id and value, then a blank value that we can exclude. If so, then the following should work. If not, try to modify the code to your purposes.
data = []
with open('filename.txt', 'r') as f:
for line in f:
# the if excludes empty strings
parts = [part for part in line.split(',') if part]
# this gets the date in a format that pandas can recognize
# you can omit the replace operations if not needed
sensor_date = parts[0].strip().replace('[', '').replace(']', '')
# the rest of the list are the parings of sensor and reading
sensor_readings = parts[1:]
# this uses list slicing to iterate over even and odd elements in list
# ::2 means every second item starting with zero, which are evens
# 1::2 means every second item starting with one, which are odds
for sensor, reading in zip(sensor_readings[::2], sensor_readings[1::2]):
data.append({'sensor_date': sensor_date,
'sensor': sensor,
'reading': reading})
pd.DataFrame(data)
Using your sample data, I got the following:
=== Output: ===
Out[64]:
sensor_date sensor reading
0 Tue Jul 02 16:35:22.782 2019 28C037080B000089 16.8750
1 Tue Jul 02 16:35:22.782 2019 284846080B000062 17.0000
2 Tue Jul 02 16:35:22.782 2019 28A4BA070B00002B 16.8750
3 Tue Jul 02 16:35:22.782 2019 28D4E3070B0000D5 16.9375
4 Tue Jul 02 16:35:22.782 2019 28A21E080B00002F 17.0000
.. ... ... ...

How to compute the difference in monthly income for the same id

The dataframe below shows the monthly revenue of two shops (shop_id=11, shop_id=15) during the period of a few years:
data = { 'shop_id' : [ 11, 15, 15, 15, 11, 11 ],
'month' : [ 1, 1, 2, 3, 2, 3 ],
'year' : [ 2011, 2015, 2015, 2015, 2014, 2014 ],
'revenue' : [11000, 5000, 4500, 5500, 10000, 8000]
}
df = pd.DataFrame(data)
df = df[['shop_id', 'month', 'year', 'revenue']]
display(df)
You can notice that shop_id=11 has only one entry in 2011 (january) and shop_id=15 has a few entries in 2015 (january, february, march). Nevertheless, it's interesting to note that the first shop has a few more entries in 2014:
I'm trying to optimize a custom function (used along with .apply()) that creates a new feature called diff_revenue: this feature shows the change in revenue from the previous month, for each shop:
I would like to offer some explanation on how some of the values found in diff_revenue were generated:
The value first cell is 0 (red) because there is no previous information for shop_id=11;
The 2nd cell is also 0 (orange), for the same reason: there is no previous information for shop_id=15;
The 3rd cell is 500 (green), because the change from the last entry (january, 2015) of this shop to the current cell's revenue (february, 2015), is 500 Trumps.
The 5th cell is 1000 (dark blue), because the change from the last entry (january, 2011) of this shop to the current cell's revenue (february, 2014) was 1000 Trumps.
I'm no expert in Pandas and was wondering if the Pandas' gods knew a better way. The DataFrame I have to work with is quite large (+1M observations) and my current approach is too slow. I'm looking for a faster alternative or maybe something more readable.
You more or less want to use Series.diff on the 'Revenue' column, but need to do a few additional things:
Sort to ensure your DataFrame is in chronological order (can undo this later)
Perform a groupby on 'shop_id' to do group level operations
Take the absolute value, since you don't want to distinguish between positive and negative
In terms of code:
# sort the values so they're in order when we perform a groupby
df = df.sort_values(by=['year', 'month'])
# perform a groupby on 'shop_id' and get the row-wise difference within each group
df['diff_revenue'] = df.groupby('shop_id')['revenue'].diff()
# fill NA as zero (no previous info), take absolute value, convert float -> int
df['diff_revenue'] = df['diff_revenue'].fillna(0).abs().astype('int')
# revert to original order
df = df.sort_index()
The resulting output:
shop_id month year revenue diff_revenue
0 11 1 2011 11000 0
1 15 1 2015 5000 0
2 15 2 2015 4500 500
3 15 3 2015 5500 1000
4 11 2 2014 10000 1000
5 11 3 2014 8000 2000
Edit
A little less straight forward solution, but maybe slightly more performant:
# sort the values so they're chronological order by shop_id
df = df.sort_values(by=['shop_id', 'year', 'month'])
# take the row-wise difference ignoring changes in shop_id
df['diff_revenue'] = df['revenue'].diff()
# zero out locations where shop_id changes (no previous info)
df.loc[df['shop_id'] != df['shop_id'].shift(), 'diff_revenue'] = 0
# Take the absolute value, convert float -> int
df['diff_revenue'] = df['diff_revenue'].abs().astype('int')
# revert to original order
df = df.sort_index()