SQL - Count All Cells In The Entire Table That Are Not NULL And Not Empty - sql

I have recently been asked to do a count of all the cells in some tables that are not NULL and not empty/blank.
The issue is, I have about 80 tables and some of those tables have dozens of columns and others have hundreds of columns.
Is there a query I could use to count all cells from all columns that fit a specific criteria (in this case not NULL and not empty/blank)?
I have done some searching and it seems most answers revolve around single columns or tables that only have like 3-5 columns.
Thanks!

Try connecting SQL with pandas using pymysql or pyodbc connector and then iterate over each column using for loop and apply the count function on it.
import pymysql
import pandas as pd
import numpy as np
con = pymysql.connect('[host name]', '[user name]','[your password]', '[database name]')
cursor = con.cursor()
df = pd.read_sql('select * from [table name]',con) # SQL converted to pandas dataframe
print(df)
for col in df.columns: # loops through column
count_ = df[col].count()
print(count_) # returns count for non-nan values

Related

Writing a scalable INSERT statement using cx_Oracle

I am attempting to write a script that will allow me to insert values from an uploaded dataframe into a table inside of an Oracle DB; but my issue lies with
too many columns to hard-code
columns aren't one-to-one
What I'm hoping for is a way to write out the columns, check to see if they sync with the columns of my dataframe and from there use an INSERT VALUES sql statement to input the values from the dataframe to the ODS table.
so far these are the important parts of my script:
import pandas as pd
import cx_Oracle
import config
df = pd.read_excel("Employee_data.xlsx")
conn = None
try:
conn = cx_Oracle.connect(config.username, config.password, config.dsn, encoding=config.encoding)
except cx_Oracle.Error as error:
print(error)
finally:
cursor = conn.cursor
sql = "SELECT * FROM ODSMGR.EMPLOYEE_TABLE"
cursor.execute(sql)
data = cursor.fetchall()
col_names = []
for i in range(0, len(cursor.description)):
col_names.append(cursor.description[i][0])
#instead of using df.columns I use:
rows = [tuple(x) for x in df.values]
which prints my ODS column names, and allows me to conveniently store my rows from the df in an array but I'm at a loss for how to import these to the ODS. I found something like:
cursor.execute("insert into ODSMGR.EMPLOYEE_TABLE(col1,col2) values (:col1, :col2)", {":col1df":df, "col2df:df"})
but that'll mean I'll have to hard-code everything which wouldn't be scalable. I'm hoping I can get some sort of insight to help. It's just difficult since the columns aren't 1-to-1 and that there is some compression/collapsing of columns from the DF to the ODS but any help is appreciated.
NOTE: I've also attempted to use SQLalchemy but I am always given an error "ORA-12505: TNS:listener does not currently know of SID given in connect descriptor" which is really strange given that I am able to connect with cx_Oracle
EDIT 1:
I was able to get a list of columns that share the same name; so after running:
import numpy as np
a = np.intersect1d(df.columns, col_names)
print("common columns:", a)
I was able to get a list of columns that the two datasets share.
I also tried to use this as my engine:
engine = create_engine("oracle+cx_oracle://username:password#ODS-test.domain.com:1521/?ODS-Test")
dtyp = {c:types.VARCHAR(df[c].str.len().max())
for c in df.columns[df.dtypes=='object'].tolist()}
df.to_sql('ODS.EMPLOYEE_TABLE', con = engine, dtype=dtyp, if_exists='append')
which has given me nothing but errors.

How to deal with discrepancy in csv and pandas Df?

when i load the dataset on pandas it shows 1,00,000 rows but when i open it on excel it shows 3,00,000 rows ? is there any python code that could help me in dealing with this kind of discrepancy
import pandas as pd
df=pd.read_csv('C_data_2.csv')
# Get the counts of each value in the gender column
counts = df['Gender'].value_counts()
# Find the most common value in the gender column
most_common = counts.index[0]
# Impute missing values in the gender column with the most common value
df['Gender'] = df['Gender'].fillna(most_common)
# Replace all instances of "nan" with most_common in the "gender" column
df["Gender"].replace("nan", most_common, inplace=True)

pandas groupby returns multiindex with two more aggregates

When grouping by a single column, and using as_index=False, the behavior is expected in pandas. However, when I use .agg, as_index no longer appears to behave as expected. In short, it doesn't appear to matter.
# imports
import pandas as pd
import numpy as np
# set the seed
np.random.seed(834)
df = pd.DataFrame(np.random.rand(10, 1), columns=['a'])
df['letter'] = np.random.choice(['a','b'], size=10)
summary = df.groupby('letter', as_index=False).agg([np.count_nonzero, np.mean])
summary
returns:
a
count_nonzero mean
letter
a 6.0 0.539313
b 4.0 0.456702
When I would have expected the axis to be 0 1 with letter as a column in the dataframe.
In summary, I want to be able to group by one or more columns, summarize a single column with multiple aggregates, and return a dataframe that does not have the group by columns as the index, nor a Multi Index in the column.
The comment from #Trenton did the trick.
summary = df.groupby('letter')['a'].agg([np.count_nonzero, np.mean]).reset_index()

How to filter in rows where any column is null in pyspark dataframe

It has to be somewhere on stackoverflow already but I'm only finding ways to filter the rows of a pyspark dataframe where 1 specific column is null, not where any column is null.
import pandas as pd
import pyspark.sql.functions as f
my_dict = {"column1":list(range(100)),"column2":["a","b","c",None]*25,"column3":["a","b","c","d",None]*20}
my_pandas_df = pd.DataFrame(my_dict)
sparkDf = spark.createDataFrame(my_pandas_df)
sparkDf.show(5)
I'm trying to include any row with null values on any column of my dataframe, basically the opposite of this:
sparkDf.na.drop()
For including rows having any columns with null:
sparkDf.filter(F.greatest(*[F.col(i).isNull() for i in sparkDf.columns])).show(5)
For excluding the same:
sparkDf.na.drop(how='any').show(5)

How do I swap two (or more) columns in two different data tables? on pandas

new here and I am new to programming.
So.. as the title says I am trying to swap two full columns from two different files (columns has the same name but different data). I started this:
import numpy as np
import pandas as pd
from pandas import DataFrame
df = pd.read_csv('table1.csv', col_name= 'COL1')
df1 = pd.read_csv('table2.csv', col_name = 'COL1')
df1.COL1 = df.COL1
But now I am stack.. how do I select whole column and how can I print the new combined table to a new file (i.e table 3)?
You could perform the swapping by copying one column in a temporary one and deleting afterwards like follows
df1['temp'] = df1['COL1']
df1['COL1'] = df['COL1']
df['COL1'] = df1['temp']
del df1['temp']
and then writing the result via to_csv to a third CSV
df1.to_csv('table3.csv')