Key Error Raise when trying to delete an existing column - pandas

RangeIndex: 381732 entries, 0 to 381731
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 381732 non-null int64
1 tweet_id 378731 non-null float64
2 time 378731 non-null object
3 tweet 378731 non-null object
4 retweet_count 336647 non-null float64
5 Unnamed: 0.1 336647 non-null float64
6 User 3001 non-null object
7 Date_Created 3001 non-null object
8 Source of Tweet 3001 non-null object
9 Tweet 3001 non-null object
dtypes: float64(3), int64(1), object(6)
memory usage: 29.1+ MB
df = df.drop(['Unnamed: 0','Unnamed: 0.1','User','Date_Created','Source of Tweet'],axis =1)
df.head()
i wrote this code to drop unwanted columns from my dataframe but i am encountering keyError not found in axis
KeyError: "['Unnamed: 0', 'Unnamed: 0.1', 'User', 'Date_Created', 'Source of Tweet'] not found in axis"

For debugging purpose, try:
cols_to_drop = ['Unnamed: 0','Unnamed: 0.1','User','Date_Created','Source of Tweet']
df = df[[col for col in df.columns if not col in cols_to_drop]]
and check the remain columns using df.info()

Related

why is pandas.to_datetime() not changing dtype?

UPDATE
I believed the MRE in my original question to be adequate, but it was not. I have since made a bunch of changes to my code that ultimately resolved the issue, and when I went back to create a workable MRE, I was unable to.
In short, I don't know what exactly I did to produce the original issue.
Originally there was a .concat() function called that created the basis for filtered_df but that doesn't seem to be the culprit.
At this point, this is considered resolved, but unsure whether the best course is to delete the question or let it sit, so I'll let it sit for now.
To close this off, below is an unsuccessful attempt to recreate the original issue for nothing more than context and background.
import pandas
df1 = pandas.DataFrame({"id": [75979], "symbol": ["USDCAD"], "time": ["2022-11-06 19:11:00-05:00"], "open": [1], "high": [1], "low": [1], "close": [1], "volume": [None]})
df2 = pandas.DataFrame({"id": [75980], "symbol": ["USDCAD"], "time": ["2022-11-06 19:12:00-05:00"], "open": [2], "high": [2], "low": [2], "close": [2], "volume": [None]})
df = pandas.concat([df1, df2])
df.reset_index(drop=True, inplace=True)
df
id symbol time open high low close volume
0 75979 USDCAD 2022-11-06 19:11:00-05:00 1 1 1 1 None
1 75980 USDCAD 2022-11-06 19:12:00-05:00 2 2 2 2 None
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 2 non-null int64
1 symbol 2 non-null object
2 time 2 non-null object
3 open 2 non-null int64
4 high 2 non-null int64
5 low 2 non-null int64
6 close 2 non-null int64
7 volume 0 non-null object
dtypes: int64(5), object(3)
memory usage: 256.0+ bytes
df["time"] = pandas.to_datetime(df["time"])
df
id symbol time open high low close volume
0 75979 USDCAD 2022-11-06 19:11:00-05:00 1 1 1 1 None
1 75980 USDCAD 2022-11-06 19:12:00-05:00 2 2 2 2 None
id symbol time open high low close volume
0 75979 USDCAD 2022-11-06 19:11:00-05:00 1 1 1 1 None
1 75980 USDCAD 2022-11-06 19:12:00-05:00 2 2 2 2 None
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 2 non-null int64
1 symbol 2 non-null object
2 time 2 non-null datetime64[ns, pytz.FixedOffset(-300)]
3 open 2 non-null int64
4 high 2 non-null int64
5 low 2 non-null int64
6 close 2 non-null int64
7 volume 0 non-null object
dtypes: datetime64[ns, pytz.FixedOffset(-300)](1), int64(5), object(2)
memory usage: 256.0+ bytes
ORIGINAL
I'm trying to convert the time column in this dataframe to a timezone aware and localized datetime64 like dtype, but the conversion isn't working:
print(f'1:\n{filtered_df.tail(1)}\ndtype: {filtered_df["time"].dtypes}')
filtered_df["time"] = pandas.to_datetime(filtered_df["time"])
print(f'2:\n{filtered_df.tail(1)}\ndtype: {filtered_df["time"].dtypes}')
#
1:
id symbol time open high low close volume
23 75979 USDCAD 2022-11-06 19:11:00-05:00 1.35102 1.35113 1.35102 1.35114 None
dtype: object
2:
id symbol time open high low close volume
23 75979 USDCAD 2022-11-06 19:11:00-05:00 1.35102 1.35113 1.35102 1.35114 None
dtype: object
Done this way, it works:
df = pandas.DataFrame({"id": [75979], "symbol": ["USDCAD"], "time": ["2022-11-06 19:11:00-05:00"], "open": [1.35102], "high": [1.3513], "low": [1.35102], "close": [1.35114], "volume": [None]})
print(f'pre: {df["time"].dtypes}')
df["time"] = pandas.to_datetime(df["time"])
print(f'post: {df["time"].dtypes}')
print(df)
#
pre: object
post: datetime64[ns, pytz.FixedOffset(-300)]
id symbol time open high low close volume
0 75979 USDCAD 2022-11-06 19:11:00-05:00 1.35102 1.3513 1.35102 1.35114 None
I'm having a hard time understanding why one works and one doesn't.

Convert and replace a string value in a pandas df with its float type

I have a value in pandas df which is accidentally put as a string as follows:
df.iloc[5329]['values']
'72,5'
I want to convert this value to float and replace it in the df. I have tried the following ways:
df.iloc[5329]['values'] = float(72.5)
also,
df.iloc[5329]['values'] = 72.5
and,
df.iloc[5329]['values'] = df.iloc[5329]['values'].replace(',', '.')
It runs successfully with a warning but when I check the df, its still stored as '72,5'.
The entire df at that index is as follows:
df.iloc[5329]
value 36.25
values 72,5
values1 72.5
currency MYR
Receipt Kuching, Malaysia
Delivery Male, Maldives
How can I solve that?
iloc needs specific row, col positioning.
import pandas as pd
df = pd.DataFrame(
{
'A': np.random.choice(100, 3),
'B': [15.2,'72,5',3.7]
})
print(df)
df.info()
Output:
A B
0 84 15.2
1 92 72,5
2 56 3.7
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 3 non-null int64
1 B 3 non-null object
Update to value:
df.iloc[1,1] = 72.5
print(df)
Output:
A B
0 84 15.2
1 92 72.5
2 56 3.7
Make sure you don't have recurring indexing (i.e. [][]) when doing assignment, since df.iloc[5329] will make a copy of data and further assignment will happen to the copy not original df. Instead just do:
df.iloc[5329, 'values'] = 72.5

Pandas - get values on a graph using quantile

I have this df_players:
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 TableIndex 739 non-null object
1 PlayerID 739 non-null int64
2 GameWeek 739 non-null int64
3 Date 739 non-null object
4 Points 739 non-null int64
5 Price 739 non-null float64
6 BPS 739 non-null int64
7 SelectedBy 739 non-null int64
8 NetTransfersIn 739 non-null int64
9 MinutesPlayed 739 non-null float64
10 CleanSheet 739 non-null float64
11 Saves 739 non-null float64
12 PlayersBasicID 739 non-null int64
13 PlayerCode 739 non-null object
14 FirstName 739 non-null object
15 WebName 739 non-null object
16 Team 739 non-null object
17 Position 739 non-null object
18 CommentName 739 non-null object
And I'm using this function, with quantile() (value passed by variable 'cut'), to plot the distribution of players:
def jointplot(X, Y, week=None, title=None,
positions=None, height=6,
xlim=None, ylim=None, cut=0.015,
color=CB91_Blue, levels=30, bw=0.5, top_rows=100000):
if positions == None:
positions = ['GKP','DEF','MID','FWD']
#Check if week is given as a list
if week == None:
week = list(range(max(df_players['GameWeek'])))
if type(week)!=list:
week = [week]
df_played = df_players.loc[(df_players['MinutesPlayed']>=45)
&(df_players['GameWeek'].isin(week))
&(df_players['Position'].isin(positions))].head(top_rows)
if xlim == None:
xlim = (df_played[X].quantile(cut),
df_played[X].quantile(1-cut))
if ylim == None:
ylim = (df_played[Y].quantile(cut),
df_played[Y].quantile(1-cut))
sns.jointplot(X, Y, data=df_played,
kind="kde", xlim=xlim, ylim=ylim,
color=color, n_levels=levels,
height=height, bw=bw);
plt.suptitle(title,fontsize=18);
plt.show()
call:
jointplot('Price', 'Points', positions=['FWD'],
color=color_list[3], title='Forwards')
this plots:
where:
xlim = (4.5, 11.892999999999995)
ylim = (1.0, 13.0)
As far as I'm concerned, these x and y limits allow me, using the range of quantile value (cut),(1-cut), to zoom into an area of datapoints.
QUESTION
Now I would like to get player 'WebName' for players within a certain area, like so:
Ater ploting I can chose a target area above and define the range, roughly, passing xlim and ylim:
jointplot('Price', 'Points', positions=['FWD'],
xlim=(5.5, 7.0), ylim=(11.5, 13.0),
color=color_list[3], title='Forwards')
which zooms in the area in red above.
But how can I get players names inside that area?
You can just select the portion of the players dataframe based on the bounds in the plot:
selected = df_players[
(df_players.Points >= points_lbound)
& (df_players.Points <= points_ubound)
& (df_players.Price >= price_lbound)
& (df_players.Price <= price_ubound)
]
The list of WebNames would then be selected.WebNames

convert csv import via pandas to separate columns

I have a csv file that came into pandas like this:
csv file:
Date,Numbers,Extra, NaN
05/17/2002,15 18 25 33 47,30,
Pandas input:
df = pd.read_csv('/Users/owner/Downloads/file.csv’)e
#s = Series('05/17/2002', '15 18 25 33 47')
#s.str.partition(' ‘)
Output
Date Numbers. Extra
<bound method NDFrame.head of Draw Date Winning Numbers Extra NaN.
05/17/2002 15 18 25 33 47 30 NaN.
<class 'pandas.core.frame.DataFrame’>
RangeIndex: 1718 entries, 0 to 1717
Data columns (total 4 columns):
Date 1718 non-null object
Numbers 1718 non-null object.
Extra 1718 non-null int64
NaN 815 non-null float64
dtypes: float64(1), int64(1), object(2)
memory usage: 53.8+ KB
How do I convert the non-null objects into two columns:
1 is a date
1 is a list
It doesn’t seem to recognize split or to.str. or headings
Thanks
I think you want this. It specifies column 0 as a date column, and a converter for column 1:
>>> df = pd.read_csv('file.csv',parse_dates=[0],converters={1:str.split})
>>> df
Date Numbers Extra NaN
0 2002-05-17 [15, 18, 25, 33, 47] 30

select rows based on rows in second column

I have two dfs and looking for an way to select (and count) rows of df1 based on rows in df2.
This is my df1:
Chromosome Start position End position Reference Variant reads \
0 chr1 109419841 109419841 C T 1
1 chr1 197008365 197008365 C T 1
variation reads % variation gDNA nomencl \
0 1 100 Chr1(GRCh37):g.109419841C>T
1 1 100 Chr1(GRCh37):g.197008365C>T
cDNA nomencl ... exon transcript ID inheritance \
0 NM_013296.4:c.-258C>T ... 2 NM_013296.4 Autosomal recessive
1 NM_001994.2:c.*143G>A ... UTR NM_001994.2 Autosomal recessive
test type Phenotype male coverage male ratio covered \
0 Unknown Deafness, autosomal recessief 0 0
1 Unknown Factor 13 deficientie 0 0
female coverage female ratio covered ratio M:F
0 1 1 0.0
1 1 1 0.0
df1 has these columns:
Chromosome 10561 non-null object
Start position 10561 non-null int64
End position 10561 non-null int64
Reference 10415 non-null object
Variant 10536 non-null object
reads 10561 non-null int64
variation reads 10561 non-null int64
% variation 10561 non-null int64
gDNA nomencl 10561 non-null object
cDNA nomencl 10446 non-null object
protein nomencl 9997 non-null object
classification 10561 non-null object
status 10561 non-null object
gene 10560 non-null object
Sanger sequencing list 10561 non-null object
exon 10502 non-null object
transcript ID 10460 non-null object
inheritance 8259 non-null object
test type 10561 non-null object
Phenotype 10380 non-null object
male coverage 10561 non-null int64
male ratio covered 10561 non-null int64
female coverage 10561 non-null int64
female ratio covered 10561 non-null int64
and this is df2:
Chromosome Startposition Endposition Bases Meancoverage \
0 chr1 11073785 11074022 27831.0 117.927966
1 chr1 11076901 11077064 11803.0 72.411043
Mediancoverage Ratiocovered>10X Ratiocovered>20X Genename Componentnr \
0 97.0 1.0 1.0 TARDBP 1
1 76.0 1.0 1.0 TARDBP 2
PositionGenes PositionGenome Position
0 TARDBP.1 chr1.11073785-11074022 comp.1_chr1.11073785-11074022
1 TARDBP.2 chr1.11076901-11077064 comp.2_chr1.11076901-11077064
I want to select all rows from df1 that have in df2
the same value for 'Chromosome'
df1['Start position'] >= df2.Startposition
df1['End position'] <= df2.Endposition.
If these three criteria are met in the same row of df2, I want to select the corresponding row in df1.
I already fused the three columns 'Chromosome','Startposition' and 'Endposition' in 'PositionGenome' to generate a lambda function but coundn't come up with anything.
Thus, hope you can help me ...
A short updata: In the end I solved the problem with unix bedtools -wb. Still I would be glad if someone could come up with an python based solution.