Filtering out columns with Pandas - pandas

enter image description hereI want to filter a pandas data-frame to only keep columns that contain a certain wildcard and then keep the two columns directly to right of this.
The Dataframe is tracking pupil grades, overall total and feedback. I only want to keep the data that corresponds to Homework and not other assessments. So in the example below I would want to keep First Name, Last Name, any homework column and the corresponding points and feedback column which are always exported to the right of this.
First Name,Last Name,Understanding Business Homework,Points,Feedback,Past Paper Homework,Points,Feedback, Groupings/Structures Questions,Points, Feedback
import pandas as pd
import numpy as np
all_data = all_data.filter(like=('Homework') and ('First Name') and
('Second Name') and ('Points'),axis=1)
print(all_data.head())
export_csv = all_data.to_csv (r'C:\Users\Sandy\Python\Automate_the_Boring_Stuff\new.csv', index = None, header=True)

Related

How to deal with discrepancy in csv and pandas Df?

when i load the dataset on pandas it shows 1,00,000 rows but when i open it on excel it shows 3,00,000 rows ? is there any python code that could help me in dealing with this kind of discrepancy
import pandas as pd
df=pd.read_csv('C_data_2.csv')
# Get the counts of each value in the gender column
counts = df['Gender'].value_counts()
# Find the most common value in the gender column
most_common = counts.index[0]
# Impute missing values in the gender column with the most common value
df['Gender'] = df['Gender'].fillna(most_common)
# Replace all instances of "nan" with most_common in the "gender" column
df["Gender"].replace("nan", most_common, inplace=True)

Pandas - finding most important words from each row

I have a pandas dataframe with a text column. I am trying to find the most important words from this text column for each row. How do I do this?
I am currently trying to do this using tf-idf:
from sklearn.feature_extraction.text import TfidfVectorizer
v = TfidfVectorizer(stop_words='english')
x = v.fit_transform(df['cleansed_text'])
I see that x is a sparse matrix with same number of rows as my dataframe and looks like number of columns equals the number of words in the vocabulary.
How do I use this to find the most important words for each row?

how to plot graded letters like A* in matplotlib

i'm a complete beginner and i have a college stats project, im comparing exam scores for our year group and the one below. i collected my own data and since i do cs i decided to try visualize the data with pandas and matplotlib (my first time). i was able to read the csv file into a dataframe with columns = Level,Grade,Difficulty,Happy,MAG. Level is just ' year group ' e.g. AS or A2. and MAG is like a minimum expected grade, the rest are numeric values out of 5.
i want to do some type of plotting but i cant' seem to get it work.
i want to plot revision against difficulty? for AS group and try show a correlation. i also want to show a barchart ( if appropriate ) for Grade Vs MAG.
here is the csv https://docs.google.com/spreadsheets/d/169UKfcet1qh8ld-eI7B4U14HIl7pvgZfQLE45NrleX8/edit?usp=sharing
this is the code so far:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df = pd.read_csv('Report Task.csv')
df.columns = ['Level','Grade','Difficulty','Revision','Happy','MAG'] #numerical values are out of 5
df[df.Level.str.match('AS')] #to get only AS group
plt.plot(df.Revision, df.Difficulty)
this is my first time ever posting on stack so im really sorry if i did something wrong.
For difficulty vs revision, you were using a line plot. You're probably looking for a scatter plot:
df = df[df.Level.str.match('AS')] # note the extra `df =` as per comments
plt.scatter(x=df.Revision, y=df.Difficulty)
plt.xlabel('Revision')
plt.ylabel('Difficulty')
Alternatively you can plot via pandas directly:
df = df[df.Level.str.match('AS')] # note the extra `df =` as per comments
df.plot.scatter(x='Revision', y='Difficulty')

How to delete two or more columns in csv using pandas

I have a csv file with 4 columns (Name, User_Name, Phone#, Email"). I want to delete those rows which have none value either in Phone# or Email. If there is none value in column (Phone#) and have some value in column(Email)or vise versa I don't want to delete that column. I hope you people will get what I want.
Sorry I don't have the code.
Thanks in advance
You can use the pandas notna() function to get a boolean series indicating which values are not missing. You can call this on both the email and the phone column and combine it with boolean | to get a truth series indicating that at least one of the email and phone columns is not missing. Then, you can use this series as a mask to filter the right rows.
import pandas as pd
# Import .csv file
df = pd.read_csv('mypath/myfile.csv')
# Filter to get rows where columns 'Email' and 'Phone#' are not both None
new_df = df[df['Email'].notna() | df['Phone#'].notna()]
# Write pandas df to disk
new_df.to_csv('mypath/mynewfile.csv', index=False)

Count frequency of multiple words

I used this code
unclassified_df['COUNT'] = unclassified_df.tweet.str.count('mulcair')
to count the number of times mulcair appeared in each row in my pandas dataframe. I am trying to repeat the same but for a set of words such as
Liberal = ['lpc','ptlib','justin','trudeau','realchange','liberal', 'liberals', "liberal2015",'lib2015','justin2015', 'trudeau2015', 'lpc2015']
I saw somewhere that I could use collection.Counter(data) and its .most_common(k) method for such, please can anyone help me out.
from collections import Counter
import pandas as pd
#check frequency for the following for each row, but no repetition for row
Liberal = ['lpc','justin','trudeau','realchange','liberal', 'liberals', "liberal2015", 'lib2015','justin2015', 'trudeau2015', 'lpc2015']
#sample data
data = {'tweet': ['lpc living dream camerama', "jsutingnasndsa dnsadnsadnsa dsalpcdnsa", "but", 'mulcair suggests thereslcp bad lpc blood']}
# the data frame with one coloumn tweet
df = pd.DataFrame(data,columns=['tweet'])
#no duplicates per row
print [(df.tweet.str.contains(word).sum(),word) for word in Liberal]
#captures all duplicates located in each row
print pd.Series({w: df.tweet.str.count(w).sum() for w in Liberal})
References:
Contains & match