Clean output of a panda data extraction deleting unnamed index column - pandas

I have a dataset which have extracted a row under condition in column 'Description'. This is first few rows to show how data look like.
I have extracted a row whith the condition below:
ATL_ID=airport_codes[airport_codes['Description'].str.contains('Hartsfield-Jackson Atlanta ')]
It successfully finds the row. Now, I need to extract the value under 'Code' I use this code:
ATL_ID.loc[:,'Code']
and output is:
373 10397
Name: Code, dtype: int64
I dnt want anything else in the output except 10397. 373 is the row index and the rest is additional description which I dnt want. How I can get one number for the 'Code'?
Thanks

Related

How to read csv files correctly using pandas?

I'm having a csv file like below. I need to check whether the number of columns are greater than the max length of rows. Ex,
name,age,profession
"a","24","teacher","cake"
"b",31,"Doctor",""
"c",27,"Engineer","tea"
If i try to read it using
print(pd.read_csv('test.csv'))
it will print as below.
name age profession
a 24 teacher cake
b 31 Doctor NaN
c 27 Engineer tea
But it's wrong. It happened due to the less number of columns. So i need to identify this scenario as a wrong csv format. what is the best way to test this other than reading this as string and testing the length of each row.
And important thing is, the columns can be different. There are no any mandatory columns to present.
You can try put header=None into .read_csv. Then pandas will throw ParserError if number of columns won't match length of rows. For example:
try:
df = pd.read_csv("your_file.csv", header=None)
except pd.errors.ParserError:
print("File Invalid")

I want to know how to get the row with 2 specific values from two different columns

I got a df of more than 13000 of rows with more than 154 columns. I have a column: 'caseid' with a value of: 2298 and i want to print out that row with the value of other column with the name of 'prglngth'. The value that i looking for is in the key: 'prglngth'.
My steps were: first: find the index of the row of the value 2298 of the 'caseid' column.
second: then try to match with the column: 'prglngth' to find the value of this column, and i already lost 48hs trying it. Any help will be appreciated!!
Try to use:
df.loc[df['caseid'] == 2298, 'prglngth']

Pandas dataframe selection df['a'][50][:51]

I have a dataframe where one of the column name is 'a'
I came across a following selection expression
dataframe['a'][50][:50]
I understand dataframe['a'][50] selects the row 49 in column ['a'], but what does [:50] do?
Thank you
If dataframe['a'][50][:50] doesn't error out and it actually returns something, it means the row 49 in column ['a'] contains iterables(more precisely sequence types) such as list, string, tuple...
dataframe['a'][50][:50] returns the sequence from element 0 to 49 from the value of the row 49 in column ['a'].
As I said above, if the row 49 in column ['a'] doesn't contain a sequence type, you will get errors. Try check dataframe['a'][50] to see if it is a sequence type
Note: dataframe['a'][50] is chain-indexing. It is not recommended. However, it is out of the scope of this question so I don't go into the detail of it.

Need explanation on how pandas.drop is working here

I have a data frame, lets say xyz. I have written code to find out the % of null values each column possess in the dataframe. my code below:
round(100*(xyz.isnull().sum()/len(xyz.index)), 2)
let say i got following results:
abc 26.63
def 36.58
ghi 78.46
I want to drop column ghi because it has more than 70% of null values.
I achieved it using the following code:
xyz = xyz.drop(xyz.loc[:,round(100*(xyz.isnull().sum()/len(xyz.index)), 2)>70].columns, 1)
but , i did not understand how does this code works, can anyone please explain it?
the code is doing the following:
xyz.drop( [...], 1)
removes the specified elements for a given axis, either by row or by column. In this particular case, df.drop( ..., 1) means you're dropping by axis 1, i.e, column
xyz.loc[:, ... ].columns
will return a list with the column names resulting from your slicing condition
round(100*(xyz.isnull().sum()/len(xyz.index)), 2)>70
this instruction is counting the number of nulls, adding them up and normalizing by the number of rows, effectively computing the percentage of nan in each column. Then, the amount is rounded to have only 2 decimal positions and finally you return True is the number of nan is more than 70%. Hence, you get a mapping between columns and a True/False array.
Putting everything together: you're first producing a Boolean array that marks which columns have more than 70% nan, then, using .loc you use Boolean indexing to look only at the columns you want to drop ( nan % > 70%), then using .columns you recover the name of such columns, which then are used by the .drop instruction.
Hopefully this clear things up!
If you code is hard to understand , you can just check dropna with thresh, since pandas already cover this case.
df=df.dropna(axis=1,thresh=round(len(df)*0.3))

How to access columns by their names and not by their positions?

I have just tried my first sqlite select-statement and got a result (an iterator over tuples). So, in other words, every row is represented by a tuple and I can access value in the cells of the row like this: r[7] or r[3] (get value from the column 7 or column 3). But I would like to access columns not by their positions but by their names. Let us say, I would like to know the value in the column user_name. What is the way to do it?
I found the answer on my question here:
cursor.execute("PRAGMA table_info(tablename)")
print cursor.fetchall()