Loop over pandas dataframe to create multiple networks - pandas

I have data of countries trade with one another. I have split the main file according to months and got 12 csv files for the year 2019. A sample of the data of January csv is provided below:
reporter partner year month trade
0 Albania Argentina 2019 01 515256
1 Albania Australia 2019 01 398336
2 Albania Austria 2019 01 7664503
3 Albania Bahrain 2019 01 400
4 Albania Bangladesh 2019 01 653907
5 Zimbabwe Zambia 2019 01 79569855
I want to make complex network for every month and print the number of nodes of every network. Now I can do it the hard (stupid) way like so.
df01 = pd.read_csv('012019.csv')
df02 = pd.read_csv('022019.csv')
df03 = pd.read_csv('032019.csv')
df1= df01[['reporter','partner', 'trade']]
df2= df02[['reporter','partner', 'trade']]
df3= df03[['reporter','partner', 'trade']]
G1 = nx.Graph()
G1 = nx.from_pandas_edgelist(df1, 'reporter', 'partner', edge_attr='trade')
G1.number_of_nodes()
and so on for the next networks.
My question is how can I use a "for loop" to read the files, convert them to networks from dataframe and report the number of nodes of each node.
I tried this but nothing is reported.
for f in glob.glob('.csv'):
df = pd.read_csv(f)
df1 = df[['reporter','partner', 'trade']]
G = nx.from_pandas_edgelist(df1, 'reporter', 'partner', edge_attr='trade')
G.number_of_nodes()
Thanks.
Edit:
Ok. So I managed to do the above using similar codes like below:
for files in glob.glob('/home/user/VMShared/network/2nd/*.csv'):
df = pd.read_csv(files)
df1=df[['reporter','partner', 'import']]
G = nx.Graph()
G = nx.from_pandas_edgelist(df1, 'reporter', 'partner', edge_attr='import')
nx.write_graphml_lxml(G, "/home/user/VMShared/network/2nd/*.graphml")
The problem that I now face is how to write separate files. All I get from this is one file titled *.graphml. How can I get graphml files for every input file? Also if I can get the same graphml output name as the input file would be a plus.

Related

Counting Distinct words AND average time in Pandas

I'm working on analysing some text from a Twitter API using pandas. This will eventually be visualized.
For reference
df.head() of my dataset
is:
Count User Time Tweet
0 0 x 2022 ✔️Nécessité de maintien d’une filière 🇪🇺 dynam...
1 1 x 2022 Échanges approfondis à #Dakar avec le Premier ...
2 2 x 2022 ✔️Approvisionnement en #céréales & #engrai...
3 3 x 2022 Aujourd’hui à Tambacounda, à l’Est du Sénégal,...
4 4 x 2022 Working hard since 2019 to reinforce EU #auton...
I'm looking to return the distinct word count with the average time of the tweet where the word was used in.
Right now, I've been getting the distinct word count of my dataset using df.Tweet.str.split(expand=True).stack().value_counts().
This is useful, returning:
the 1505
de 1500
to 1168
RT 931
of 906
...
africain, 1
langue 1
Félicitations! 1
Length: 18071, dtype: int64
However, I want to also analyse text usage over time.
I'm not super experienced so I'm wondering if there is a way to use a function such as df.groupby() to sort this result by time? Or, is there a way to modify my original function to add a column to my results that includes average time?
I would use str.extractall to get the words, join the Time, then perform a groupby.value_counts to get the count per Year:
out = (df['Tweet']
.str.extractall('(\S+)')
.droplevel('match')
.join(df['Time'])
.groupby('Time')[0].value_counts()
)
NB. if you want to exclude non-letters/digits from the words, use (\w+) in place of (\S+).
Output:
Time 0
2022 à 3
#Dakar 1
#auton... 1
#céréales 1
#engrai... 1
& 1
... 1
...

Pandas Groupby Multiple Conditions KeyError

I have a df called df_out with column names such as this in the following insert but I cannot for some reason use 'groupby' function with the column headers since it keeps giving me KeyError: 'year'. I"ve researched and tried stripping white space, resetting the index, allowing white space before my groupby setting, etc and I cannot get past this KeyError. The df_out looks like this:
df_out.columns
Out[185]:
Index(['year', 'month', 'BARTON CHAPEL', 'BARTON I', 'BIG HORN I',
'BLUE CREEK', 'BUFFALO RIDGE I', 'CAYUGA RIDGE', 'COLORADO GREEN',
'DESERT WIND', 'DRY LAKE I', 'EL CABO', 'GROTON', 'NEW HARVEST',
'PENASCAL I', 'RUGBY', 'TULE'],
dtype='object', name='plant_name')
But, when I use df_out.head(), I get a different answer with the leading column of 'plant_name' so this maybe is where the error is coming from or related. Here is the output columns from -
df_out.head()
Out[187]:
plant_name year month BARTON CHAPEL BARTON I BIG HORN I BLUE CREEK \
0 1991 1 6.432285 7.324126 5.170067 6.736384
1 1991 2 7.121324 6.973586 4.922693 7.473527
2 1991 3 8.125793 8.681317 5.796599 8.401855
3 1991 4 7.454972 8.037764 7.272292 7.961625
4 1991 5 7.012809 6.530013 6.626949 6.009825
plant_name BUFFALO RIDGE I CAYUGA RIDGE COLORADO GREEN DESERT WIND \
0 7.163790 7.145323 5.783629 5.682003
1 7.595744 7.724717 6.245952 6.269524
2 8.111411 9.626075 7.918871 6.657648
3 8.807458 8.618806 7.011444 5.848736
4 7.734852 6.267097 7.410013 5.099610
plant_name DRY LAKE I EL CABO GROTON NEW HARVEST PENASCAL I \
0 4.721089 10.747285 7.456640 6.921801 6.296425
1 5.095923 8.891057 7.239762 7.449122 6.484241
2 8.409637 12.238508 8.274046 8.824758 8.444960
3 7.893694 10.837139 6.381736 8.840431 7.282444
4 8.496976 8.636882 6.856747 7.469825 7.999530
plant_name RUGBY TULE
0 7.028360 4.110605
1 6.394687 5.257128
2 6.859462 10.789516
3 7.590153 7.425153
4 7.556546 8.085255
My groupby statement that is getting the KeyError looks like this and I'm trying to calculate the average by rows of year and month based on a subset of columns from df_out found in the list - 'west':
west=['BIG HORN I','DRY LAKE I', 'TULE']
westavg = df_out[df_out.columns[df_out.columns.isin(west)]].groupby(['year','month']).mean()
thank you very much,
Your code can be broken down as:
westavg = (df_out[df_out.columns[df_out.columns.isin(west)]]
.groupby(['year','month']).mean()
)
which is not working because ['year','month'] are not columns of df_out[df_out.columns[df_out.columns.isin(west)]].
Try:
west_cols = [c for c in df_out if c in west]
westavg = df_out.groupby(['year','month'])[west_cols].mean()
Ok, with the help of Quang Hoang below, I understood the problem and came up with this answer that works that I am able to understand a bit better using .intersection:
westavg = df_out[df_out.columns.intersection(west)].mean(axis=1)
#gives me average of each row from the subset of columns defined by the list 'west'`.

Treat Header to Data in Dataframe

I am using a package to read in a table from a pdf. The source table is badly formed so I have a series of inconsistently formatted tables I have to clean on the backend (so reading with header "none" is not an option). The first row, which is data, is being treated as a header. How can I get that first row treated as a data row so I can add a proper header? (output below truncated as it has numerous columns)
**Asia Afghanistan 35,939**
0 Asia Bahrain 972
1 Asia Bhutan 1,910
2 Asia Brunei 111
3 Asia Burma 20,078
4 Asia Cambodia 179,662
Goal is for the "Afganistan" header row to drop to index 0 and then label Continent, Country, Total.
Thanks in advance, this has driven me nuts
Note in request to actual code, see below, the issue is in Tables[1]
import pandas as pd
import tabula
file = "https://travel.state.gov/content/dam/visas/Diversity-Visa/DVStatistics/DV-applicant-entrants-by-country-2019-2021.pdf"
tables = tabula.read_pdf(file, pages = "all", multiple_tables = True)
file = "https://travel.state.gov/content/dam/visas/Diversity-Visa/DVStatistics/DV-applicant-entrants-by-country-2019-2021.pdf"
tables = tabula.read_pdf(file, pages = "all", multiple_tables = True)
tables[1].head()
# note I tried to use zip this only creates a multilevel header, not the desire effect of pushing the current header down as data and adding a new header
ColumnNames =['Region','Foreign State of Chargeability','FY 2019 Entrants','FY 2019 Derivatives','FY 2019 Total','FY 2020 Entrants','FY 2020 Derivatives','FY 2020 Total','FY 2021 Entrants','FY 2021 Derivatives','FY 2021 Total']
tables[1].columns = pd.MultiIndex.from_tuples(
zip(ColumnNames,
tables[1].columns))
tables[1].reset_index(0)
tables[1].head()
OK, I got it. Perhaps not the most elegant solution. Created a one-column data frame from the "Column labels" (which was actually data) where it has the real column labels, added those labes to the original frame(which over writes that row), then concat
ColumnNames =['Region','Foreign State of Chargeability','FY 2019 Entrants','FY 2019 Derivatives','FY 2019 Total','FY 2020 Entrants','FY 2020 Derivatives','FY 2020 Total','FY 2021 Entrants','FY 2021 Derivatives','FY 2021 Total']
First_Row = tables[1].columns.values.tolist()
#make one column dataframe
dTemp= pd.DataFrame(First_Row)
dTemp=dTemp.transpose()
dTemp.columns=ColumnNames
# we can now write over fist column with column labels
tables[1].columns=ColumnNames
tables[1] = pd.concat([dTemp,tables[1]], axis=0,ignore_index=True)

Averaging dataframes with many string columns and display back all the columns

I have struggled with this even after looking at the various past answers to no avail.
My data consists of columns numeric and non numeric. I'd like to average the numeric columns and display my data on the GUI together with the information on the non-numeric columns.The non numeric columns have info such as names,rollno,stream while the numeric columns contain students marks for various subjects. It works well when dealing with one dataframe but fails when I combine two or more dataframes in which it returms only the average of the numeric columns and displays it leaving the non numeric columns undisplayed. Below is one of the codes I've tried so far.
df=pd.concat((df3,df5))
dfs =df.groupby(df.index,level=0).mean()
headers = list(dfs)
self.marks_table.setRowCount(dfs.shape[0])
self.marks_table.setColumnCount(dfs.shape[1])
self.marks_table.setHorizontalHeaderLabels(headers)
df_array = dfs.values
for row in range(dfs.shape[0]):
for col in range(dfs.shape[1]):
self.marks_table.setItem(row, col,QTableWidgetItem(str(df_array[row,col])))
A working code should return averages in something like this
STREAM ADM NAME KCPE ENG KIS
0 EAGLE 663 FLOYCE ATI 250 43 5
1 EAGLE 664 VERONICA 252 32 33
2 EAGLE 665 MACREEN A 341 23 23
3 EAGLE 666 BRIDGIT 286 23 2
Rather than
ADM KCPE ENG KIS
0 663.0 250.0 27.5 18.5
1 664.0 252.0 26.5 33.0
2 665.0 341.0 17.5 22.5
3 666.0 286.0 38.5 23.5
Sample data
Df1 = pd.DataFrame({
'STREAM':[NORTH,SOUTH],
'ADM':[437,238,439],
'NAME':[JAMES,MARK,PETER],
'KCPE':[233,168,349],
'ENG':[70,28,79],
'KIS':[37,82,79],
'MAT':[67,38,29]})
Df2 = pd.DataFrame({
'STREAM':[NORTH,SOUTH],
'ADM':[437,238,439],
'NAME':[JAMES,MARK,PETER],
'KCPE':[233,168,349],
'ENG':[40,12,56],
'KIS':[33,43,43],
'MAT':[22,58,23]})
Your question not clear. However guessing the origin of question based on content. I have modified your datframes which were not well done by adding a stream called 'CENTRAL', see
Df1 = pd.DataFrame({'STREAM':['NORTH','SOUTH', 'CENTRAL'],'ADM':[437,238,439], 'NAME':['JAMES','MARK','PETER'],'KCPE':[233,168,349],'ENG':[70,28,79],'KIS':[37,82,79],'MAT':[67,38,29]})
Df2 = pd.DataFrame({ 'STREAM':['NORTH','SOUTH','CENTRAL'],'ADM':[437,238,439], 'NAME':['JAMES','MARK','PETER'],'KCPE':[233,168,349],'ENG':[40,12,56],'KIS':[33,43,43],'MAT':[22,58,23]})
I have assumed you want to merge the two dataframes and find avarage
df3=Df2.append(Df1)
df3.groupby(['STREAM','ADM','NAME'],as_index=False).sum()
Outcome

How to construct a data frame from raw data from a CSV file

I am currently learning the python environment to process sensor data.
I have a board with 32 sensors reading temperature. At the following link, you can find an extract of the raw data: https://5e86ea3db5a86.htmlsave.net/
I am trying to construct a data frame grouped by date from my CSV file using pandas (see the potential structure of the table https://docs.google.com/spreadsheets/d/1zpDI7tp4nSn8-Hm3T_xd4Xz7MV6VDGcWGxwNO-8S0-s/edit?usp=sharing
So far, I have read the data file in pandas and delete all the unnamed columns. I am struggling with the creation of a column sensor ID which should contain the 32 sensor ID and the column temperature.
How should I loop through this CSV file to create 3 columns (date, sensor ID and temperature)?
Thanks for the help
It looks like the first item in each line is the date, then there are pairs of sensor id and value, then a blank value that we can exclude. If so, then the following should work. If not, try to modify the code to your purposes.
data = []
with open('filename.txt', 'r') as f:
for line in f:
# the if excludes empty strings
parts = [part for part in line.split(',') if part]
# this gets the date in a format that pandas can recognize
# you can omit the replace operations if not needed
sensor_date = parts[0].strip().replace('[', '').replace(']', '')
# the rest of the list are the parings of sensor and reading
sensor_readings = parts[1:]
# this uses list slicing to iterate over even and odd elements in list
# ::2 means every second item starting with zero, which are evens
# 1::2 means every second item starting with one, which are odds
for sensor, reading in zip(sensor_readings[::2], sensor_readings[1::2]):
data.append({'sensor_date': sensor_date,
'sensor': sensor,
'reading': reading})
pd.DataFrame(data)
Using your sample data, I got the following:
=== Output: ===
Out[64]:
sensor_date sensor reading
0 Tue Jul 02 16:35:22.782 2019 28C037080B000089 16.8750
1 Tue Jul 02 16:35:22.782 2019 284846080B000062 17.0000
2 Tue Jul 02 16:35:22.782 2019 28A4BA070B00002B 16.8750
3 Tue Jul 02 16:35:22.782 2019 28D4E3070B0000D5 16.9375
4 Tue Jul 02 16:35:22.782 2019 28A21E080B00002F 17.0000
.. ... ... ...