How to deal with discrepancy in csv and pandas Df? - pandas

when i load the dataset on pandas it shows 1,00,000 rows but when i open it on excel it shows 3,00,000 rows ? is there any python code that could help me in dealing with this kind of discrepancy
import pandas as pd
df=pd.read_csv('C_data_2.csv')
# Get the counts of each value in the gender column
counts = df['Gender'].value_counts()
# Find the most common value in the gender column
most_common = counts.index[0]
# Impute missing values in the gender column with the most common value
df['Gender'] = df['Gender'].fillna(most_common)
# Replace all instances of "nan" with most_common in the "gender" column
df["Gender"].replace("nan", most_common, inplace=True)

Related

Extracting portions of the entries of Pandas dataframe

I have a Pandas dataframe with several columns wherein the entries of each column are a combination of​ numbers, upper and lower case letters and some special characters:, i.e, "=A-Za-z0-9_|"​. Each entry of the column is of the form:
​'x=ABCDefgh_5|123|' ​
I want to retain only the numbers 0-9 appearing only between | | and strip out all other characters​. Here is my code for one column of the dataframe:
list(map(lambda x: x.lstrip(r'\[=A-Za-z_|,]+'), df[1]))
However, the code returns the full entry ​'x=ABCDefgh_5|123|' ​ without stripping out anything. Is there an error in my code?
Instead of working with these unreadable regex expressions, you might want to consider a simple split. For example:
import pandas as pd
d = {'col': ["x=ABCDefgh_5|123|", "x=ABCDefgh_5|123|"]}
df = pd.DataFrame(data=d)
output = df["col"].str.split("|").str[1]

Converting list of nested dicts to Dataframe

I am trying to convert a list of dicts with the following format to a single Dataframe where each row contains the a specific type of betting odds offered by one sports book (meaning ‘h2h’ odds and ‘spread’ odds are in separate rows):
temp = [{"id":"e4cb60c1cd96813bbf67450007cb2a10",
"sport_key":"americanfootball",
"sport_title":"NFL",
"commence_time":"2022-11-15T01:15:31Z",
"home_team":"Philadelphia Eagles",
"away_team":"Washington Commanders",
"bookmakers":
[{"key":"fanduel","title":"FanDuel",
"last_update":"2022-11-15T04:00:35Z",
"markets":[{"key":"h2h","outcomes":[{"name":"Philadelphia
Eagles","price":630},{"name":"Washington Commanders","price":-1200}]}]},
{"key":"draftkings","title":"DraftKings",
"last_update":"2022-11-15T04:00:30Z",
"markets":[{"key":"h2h","outcomes":[{"name":"Philadelphia Eagles","price":600},
{"name":"Washington Commanders","price":-950}]}]},
There are many more bookmaker entries of the same format. I have tried:
df = pd.DataFrame(temp)
# normalize the column of dicts
normalized = pd.json_normalize(df['bookmakers'])
# join the normalized column to df
df = df.join(normalized,).drop(columns=['bookmakers'])
# join the normalized column to df
df = df.join(normalized, lsuffix = 'key')
However, this results in a Dataframe with repeated columns and columns that contain dictionaries.
Thanks for any help in advance!

Pandas remove all columns before a column with a matching value is found

I have multiple csv files that I want to import and remove all columns before a column that contains a date and then concatenate them together to have the date in the first column matching all dataframes. All files have date columns which may have different column indexes. I also insert the filename as a column.
As of now I'm removing all lines with no dates but have no idea how to find the index of the matching column to remove all the columns before?
import os
import pd as pandas
import glob as glob
globbed_files = glob.glob("*.csv")
data = []
for csv in globbed_files:
frame = pd.read_csv(csv)
frame['g'] = os.path.basename(csv).split(".")[0]
data.append(frame)
bigframe = pd.concat(data, ignore_index=True) #dont want pandas to try an align row indexes
bigframe.to_csv("processed.csv")

How to delete two or more columns in csv using pandas

I have a csv file with 4 columns (Name, User_Name, Phone#, Email"). I want to delete those rows which have none value either in Phone# or Email. If there is none value in column (Phone#) and have some value in column(Email)or vise versa I don't want to delete that column. I hope you people will get what I want.
Sorry I don't have the code.
Thanks in advance
You can use the pandas notna() function to get a boolean series indicating which values are not missing. You can call this on both the email and the phone column and combine it with boolean | to get a truth series indicating that at least one of the email and phone columns is not missing. Then, you can use this series as a mask to filter the right rows.
import pandas as pd
# Import .csv file
df = pd.read_csv('mypath/myfile.csv')
# Filter to get rows where columns 'Email' and 'Phone#' are not both None
new_df = df[df['Email'].notna() | df['Phone#'].notna()]
# Write pandas df to disk
new_df.to_csv('mypath/mynewfile.csv', index=False)

SQL - Count All Cells In The Entire Table That Are Not NULL And Not Empty

I have recently been asked to do a count of all the cells in some tables that are not NULL and not empty/blank.
The issue is, I have about 80 tables and some of those tables have dozens of columns and others have hundreds of columns.
Is there a query I could use to count all cells from all columns that fit a specific criteria (in this case not NULL and not empty/blank)?
I have done some searching and it seems most answers revolve around single columns or tables that only have like 3-5 columns.
Thanks!
Try connecting SQL with pandas using pymysql or pyodbc connector and then iterate over each column using for loop and apply the count function on it.
import pymysql
import pandas as pd
import numpy as np
con = pymysql.connect('[host name]', '[user name]','[your password]', '[database name]')
cursor = con.cursor()
df = pd.read_sql('select * from [table name]',con) # SQL converted to pandas dataframe
print(df)
for col in df.columns: # loops through column
count_ = df[col].count()
print(count_) # returns count for non-nan values