Python Pandas: Append column names in each row - pandas

Is there a way to append column names in dataframe rows?
input:
cv cv mg mg
5g 5g 0% zinsenzin
output:
cv cv col_name mg mg col_name
5g 5g cv 0% zinsenzin mg
I tried by this, but it's not working
list_col = list(df)
for i in list_col:
if i != i.shift(1)
df['new_col'] = i
I got stuck here and can't find any solution.

In pandas working with duplicated columns names is not easy, but possible:
c = 'cv cv mg mg sa sa ta ta at at ad ad an av av ar ar ai ai ca ca ch ch ks ks ct ct ce ce cw cw dt dt fr fr fs fs fm fm it it lg lg mk mk md md mt mt ob ob ph ph pb pb rt rt sz sz tg tg tt tt vv vv yq yq fr fr ms ms lp lp ts ts mv mv'.split()
df = pd.DataFrame([range(77)], columns=c)
print (df)
cv cv mg mg sa sa ta ta at at ... fr fr ms ms lp lp ts \
0 0 1 2 3 4 5 6 7 8 9 ... 67 68 69 70 71 72 73
ts mv mv
0 74 75 76
[1 rows x 77 columns]
df = pd.concat([v.assign(new_col=k) for k, v in df.groupby(axis=1,level=0,sort=False)],axis=1)
print (df)
cv cv new_col mg mg new_col sa sa new_col ta ... new_col lp lp \
0 0 1 cv 2 3 mg 4 5 sa 6 ... ms 71 72
new_col ts ts new_col mv mv new_col
0 lp 73 74 ts 75 76 mv
[1 rows x 115 columns]

Related

Edit binary data in PDF with SED / BBE (change colors in a PDF)

I want to change some background colors in a batch of PDF's
I found out that the color information is stored in the first stream - endstream block
in a format like such: 1 1 1 sc which in this example represents white #FFFFFF
here an example after I decode the binary stream with
qpdf --qdf --object-streams=disable IN.pdf OUT.pdf
stream
q Q q /Cs1 cs 0.9686275 0.9725490 0.9764706 sc 0 12777 m 600 12777 l 600 0
l 0 0 l h f 0 12777 m 600 12777 l 600 0 l 0 0 l h f ➡️1 1 1 sc⬅️ 0 12575 m 600
12575 l 600 12308 l 0 12308 l h f 0.1254902 0.2666667 0.3921569 sc 0 872 m
600 872 l 600 462 l 0 462 l h f 0 462 m 600 462 l 600 0 l 0 0 l h f ➡️1 1 1
sc⬅️ 0 12297 m 600 12297 l 600 5122 l 0 5122 l h f 0.7411765 0.8980392 0.9725490
sc 23 7249 m 577 7249 l 577 6007 l 23 6007 l h f 1 0.9215686 0.9333333 sc
23 5848 m 577 5848 l 577 5533 l 23 5533 l h f 0.9686275 0.9725490 0.9764706
sc 23 5510 m 577 5510 l 577 5156 l 23 5156 l h f ➡️1 1 1 sc⬅️ 0 5110 m 600 5110
...
endstream
If I open the PDF in TextEdit and manually replace 1 1 1 sc with 0 1 0 sc my white background immediately changes to green after saving the PDF file.
How can I do this in an automated way with a Text Tool?
sed 's/1 1 1 sc/0 1 0 sc/' IN.pdf > OUT.pdf
gives me the error: sed: RE error: illegal byte sequence
bbe -e 's/0 1 1 sc/0 1 0 sc/' IN.pdf > OUT.pdf
no errors, OUT.pdf is written but no colors have changed
echo 'hello 1 1 1 sc world' | bbe -e 's/1 1 1 sc/0 1 0 sc/'
seems to work fine...
In the above stream (the first stream block) in the 1-page PDF file I need to replace only the second and third find. The second one has a line break?
It is not completely clear what you are doing.
You mention commands:
qpdf --qdf --object-streams=disable IN.pdf OUT.pdf
sed 's/1 1 1 sc/0 1 0 sc/' IN.pdf > OUT.pdf
bbe -e 's/0 1 1 sc/0 1 0 sc/' IN.pdf > OUT.pdf
It is not obvious if IN.pdf in the sed or bbe commands is the same IN.pdf file as the qpdf command.
If all three commands are using the same file as input, then that can explain why bbe fails.
Another possibility is that the bbe command shown is the command you are actually using and not a typo. It does not actually look for the string 1 1 1 sc.
sed is not designed to work with binary data.
Although the GNU implementation has a non-standard -z option to help read binary files, it still works on a form of "lines". Perl can be used as an improved sed here.
To change only the first three instances of the string 1 1 1 sc in the file, you could try:
qpdf --qdf --object-streams=disable IN.pdf - |\
perl -0777 -pe 'for $i (1..3) { s/1 1 1 sc/0 1 0 sc/ }' |\
qpdf - OUT.pdf
In this Perl command:
-0777 - treat entire input as single record
-pe - run command on each record, then print (like sed)
for $i (1..3) { ... } - run three times
s/.../.../ - similar to sed's s/// command
I think I will tackle this task with PikePDF, a Python library which seems to be able to work with content streams: https://pikepdf.readthedocs.io/en/latest/topics/content_streams.html
I was just able to Pretty Print the content streams by using:
#!/usr/bin/env python
from pikepdf import Pdf
import pikepdf
with pikepdf.open('IN.pdf') as pdf:
page = pdf.pages[0]
instructions = pikepdf.parse_content_stream(page)
data = pikepdf.unparse_content_stream(instructions)
print(data.decode('ascii'))
Now working my way to actual Edit the content stream ..........
Here the stream fragment from my question, pretty printed:
q
Q
q
/Cs1 cs
0.9686275 0.9725490 0.9764706 sc
0 12777 m
600 12777 l
600 0 l
0 0 l
h
f
0 12777 m
600 12777 l
600 0 l
0 0 l
h
f
➡️1 1 1 sc⬅️
0 12575 m
600 12575 l
600 12308 l
0 12308 l
h
f
0.1254902 0.2666667 0.3921569 sc
0 872 m
600 872 l
600 462 l
0 462 l
h
f
0 462 m
600 462 l
600 0 l
0 0 l
h
f
➡️1 1 1 sc⬅️
0 12297 m
600 12297 l
600 5122 l
0 5122 l
h
f
0.7411765 0.8980392 0.9725490 sc
23 7249 m
577 7249 l
577 6007 l
23 6007 l
h
f
1 0.9215686 0.9333333 sc
23 5848 m
577 5848 l
577 5533 l
23 5533 l
h
f
0.9686275 0.9725490 0.9764706 sc
23 5510 m
577 5510 l
577 5156 l
23 5156 l
h
f
➡️1 1 1 sc⬅️
0 5110 m
600 5110
Some more info about the color value:
Just divide the RGB values by 255
for example:
DeepSkyBlue = #00bfff = RGB(0, 191, 255)
0/255 = 0
191/255 = 0.7490196
255/255 = 1
0 0.7490196 1 sc

Pandas - Groupby by three columns with cumsum or cumcount [duplicate]

I need to create a new "identifier column" with unique values for each combination of values of two columns. For example, the same "identifier" should be used when ID and phase are the same (e.g. r1 and ph1 [but a new, unique value should be added to the column when r1 and ph2])
df
ID phase side values
r1 ph1 l 12
r1 ph1 r 34
r1 ph2 l 93
s4 ph3 l 21
s3 ph2 l 88
s3 ph2 r 54
...
I would need a new column (idx) like so:
new_df
ID phase side values idx
r1 ph1 l 12 1
r1 ph1 r 34 1
r1 ph2 l 93 2
s4 ph3 l 21 3
s3 ph2 l 88 4
s3 ph2 r 54 4
...
I've tried applying code from this question but could no achieve a way to increment the values in idx.
Try with groupby ngroup + 1, use sort=False to ensure groups are enumerated in the order they appear in the DataFrame:
df['idx'] = df.groupby(['ID', 'phase'], sort=False).ngroup() + 1
df:
ID phase side values idx
0 r1 ph1 l 12 1
1 r1 ph1 r 34 1
2 r1 ph2 l 93 2
3 s4 ph3 l 21 3
4 s3 ph2 l 88 4
5 s3 ph2 r 54 4

How to plot distribution of final grade by sex?

I'm working on predicting student performance based on various different factors. This is a link to my data: https://archive.ics.uci.edu/ml/datasets/Student+Performance#. This is a sample of the observations from the sex and final grade data columns:
sex G3
F 6
F 6
F 10
F 15
F 10
M 15
M 11
F 6
M 19
M 15
F 9
F 12
M 14
I'm looking at the distribution of my target variable (final grade):
ax= sns.kdeplot(data=df2, x="G3", shade=True)
ax.set(xlabel= 'Final Grade', ylabel= 'Density', title= 'Distribution of Final Grade')
plt.xlim([0, 20])
plt.show()
Screenshot of Distribution of Final Grade
And now I want to find out how the distribution of final grades differ by sex:
How can I do this?
Considering the sample data.
df2 = pd.DataFrame({'sex': ['F','F','F','F','F','M','M','F','M','M','F','F','M'], 'grades': [6,6,10,15,10,15,11,6,19,15,9,12,14]})
sex G3
F 6
F 6
F 10
F 15
F 10
M 15
M 11
F 6
M 19
M 15
F 9
F 12
M 14
We use the seaborn countplot function as follows.
sns.countplot(x="grades", hue='sex', data=df2)
To get the following plot.

Extracting and Parsing Table from HTML using VBA

I am using Microsoft Office Version 1703.
I have been tasked with:
Creating a weekly Excel sheet using data from AccuWeather Professional for 10 specific locations and have that updated weekly.
Creating historical data going back 4 or 5 years for the same multiple locations. Ideally I'd like to take the time to automate this as it has been considered a long term project.
Now the pos for doing this was originally using Text to Columns in Excel. If I use Text to Column it imports it as an array and I have to use space as a delimiter to break them down into columns and rows correctly before finally hand inputting it into the presentation sheet.
There is a picture of the accuweather site and the information I'm attempting to grab:
When simply copying and pasting the data I receive this as an array for example:
TODAY'S DATE: 2-JUN-17
JUN-17 FOR Monticello White County Airp, IN (676') LAT=40.7N LON= 86.8W
TEMPERATURE PRECIPITATION
ACTUAL NORMAL
HI LO AVG HI LO AVG DEPT AMNT SNOW SNCVR HDD
1 81 48 65 78 55 66 -1 0.00 0.0e 0 0
2 M M M 78 55 67 M M 0.0 0 M
3 M M M 78 56 67 M M 0.0 0 M
4 M M M 79 56 67 M M 0.0 0 M
5 M M M 79 56 68 M M 0.0 0 M
6 M M M 79 57 68 M M 0.0 0 M
7 M M M 79 57 68 M M 0.0 0 M
8 M M M 80 57 69 M M 0.0 0 M
9 M M M 80 58 69 M M 0.0 0 M
10 M M M 80 58 69 M M 0.0 0 M
11 M M M 80 58 69 M M 0.0 0 M
12 M M M 81 58 70 M M 0.0 0 M
13 M M M 81 59 70 M M 0.0 0 M
14 M M M 81 59 70 M M 0.0 0 M
15 M M M 81 59 70 M M 0.0 0 M
16 M M M 81 59 70 M M 0.0 0 M
17 M M M 82 60 71 M M 0.0 0 M
18 M M M 82 60 71 M M 0.0 0 M
19 M M M 82 60 71 M M 0.0 0 M
20 M M M 82 60 71 M M 0.0 0 M
21 M M M 82 60 71 M M 0.0 0 M
22 M M M 82 61 72 M M 0.0 0 M
23 M M M 83 61 72 M M 0.0 0 M
24 M M M 83 61 72 M M 0.0 0 M
25 M M M 83 61 72 M M 0.0 0 M
26 M M M 83 61 72 M M 0.0 0 M
27 M M M 83 61 72 M M 0.0 0 M
28 M M M 83 61 72 M M 0.0 0 M
29 M M M 83 62 73 M M 0.0 0 M
30 M M M 84 62 73 M M 0.0 0 M
TOTALS FOR KMCX
HIGHEST TEMPERATURE 81 TOTAL PRECIP 0.00
LOWEST TEMPERATURE 48 TOTAL SNOWFALL 0.0
AVERAGE TEMPERATURE 64.5 NORMAL PRECIP 4.08
DEPARTURE FROM NORM -2.0 % OF NORMAL PRECIP 0
HEATING DEGREE DAYS 0
NORMAL DEGREE DAYS 0
shows up like this:
The HTML selector is:
body > center > table > tbody > tr > td.pageContent > table > tbody > tr:nth-child(2) > td > table > tbody > tr:nth-child(1) > td > font > table:nth-child(5) > tbody > tr > td > pre
The issue with doing a Web Query is that even if I have Internet Explorer save my password it will not login in Web Query. I managed to frankenstein a VBA script that opens I.E., logs in successfully, and navigates to this intended page. I imagine I could create individual scripts in a sequence to accomplish grabbing the weather data for each specific location fairly easily. The problem I'm having is writing a VBA script to only grab what is between that <pre> I referenced above. Right now I have the script selecting all, copying and pasting it into my sheet.
What I would ideally like to accomplish is Navigate to AccuWeather Pro, succesfull Log In, Pull up historical data for specific location. Grab all the data referenced above, import it into Excel, and format it to my presentation sheet automatically. It'd be even nicer if I could get it to automatically update at least weekly.
Here is my VBA code:
Sub Test()
Dim ieApp As Object
Sheets("Sheet1").Select
Range("A1:A1000") = "" ' erase previous data
Range("A1").Select
Set ieApp = CreateObject("InternetExplorer.Application")
With ieApp
.Visible = True
.Navigate "https://wwwl.accuweather.com/error.php?url=proa.accuweather.com/adcbin/professional/forecast_local.asp?zipcode=47960&mt=pro"
Do While .Busy: DoEvents: Loop
Do Until .ReadyState = READYSTATE_COMPLETE: DoEvents: Loop
Set ieDoc = .Document
' fill in the login form – View Source from your browser to get the control names
With ieDoc.forms(0)
.UserName.Value = "username"
.Password.Value = "password"
.Submit
End With
Do While .Busy: DoEvents: Loop
Do Until .ReadyState = READYSTATE_COMPLETE: DoEvents: Loop
' now that we’re in, go to the page we want
.Visible = True
.Navigate "http://proa.accuweather.com/adcbin/professional/historical_index.asp"
Do While .Busy: DoEvents: Loop
Do Until .ReadyState = READYSTATE_COMPLETE: DoEvents: Loop
.ExecWB 17, 0 ' // SelectAll
.ExecWB 12, 2 ' // Copy selection
ActiveSheet.PasteSpecial Format:="Text", link:=False, DisplayAsIcon:=False
Range("A1").Select
.Quit
.Quit ' just to make sure
End With
End Sub
I did my best to be as thorough, accurate, and correct with my question as possible, I apologize if I've committed any stack exchange social faux pas etc.

inserting an empty line in between every two elements a column (data frame + pandas)

My data frame looks something like this:
Games
0 CAR 20
1 DEN 21
2 TB 31
3 ATL 24
4 SD 27
5 KC 33
6 CIN 23
7 NYJ 22
import pandas as pd
df =pd.read_csv('weekone.txt',)
df.columns=['Games']
I'm trying to put a blank line in between every two elements (teams).
So I want it to look like this:
Games
0 CAR 20
1 DEN 21
2 TB 31
3 ATL 24
4 SD 27
5 KC 33
6 CIN 23
7 NYJ 22
But when I'm using this loop
for i in df2.index:
if (df2.index[i])%2 == 1:
df2.Games[i]=df2.Games[i]+('\n')
else:
df2.Games[i] = df2.Games[i]
I'm getting an output like this:
Games
0 CAR 20
1 DEN 21\n
2 TB 31
3 ATL 24\n
4 SD 27
5 KC 33\n
6 CIN 23
7 NYJ 22\n
What am I doing wrong? Thanks.
you can do it this way:
In [172]: x
Out[172]:
Games
0 CAR 20
1 DEN 21
2 TB 31
3 ATL 24
4 SD 27
5 KC 33
6 CIN 23
7 NYJ 22
In [173]: %paste
empty_line = pd.DataFrame([''], columns=x.columns, index=[''])
rslt = x.loc[:1]
g = x.groupby(x.index//2)
for i in range(1, len(g)):
rslt = pd.concat([rslt.append(empty_line), g.get_group(i)])
## -- End pasted text --
In [174]: rslt
Out[174]:
Games
0 CAR 20
1 DEN 21
2 TB 31
3 ATL 24
4 SD 27
5 KC 33
6 CIN 23
7 NYJ 22
the index's dtype is object now:
In [178]: rslt.index.dtype
Out[178]: dtype('O')
or having -1 as an index for empty lines:
In [175]: %paste
empty_line = pd.DataFrame([''], columns=x.columns, index=[-1])
rslt = x.loc[:1]
g = x.groupby(x.index//2)
for i in range(1, len(g)):
rslt = pd.concat([rslt.append(empty_line), g.get_group(i)])
## -- End pasted text --
In [176]: rslt
Out[176]:
Games
0 CAR 20
1 DEN 21
-1
2 TB 31
3 ATL 24
-1
4 SD 27
5 KC 33
-1
6 CIN 23
7 NYJ 22
index dtype:
In [181]: rslt.index.dtype
Out[181]: dtype('int64')