I have the following column in a table.
daily;1;21/03/2015;times;10
daily;1;01/02/2016;times;8
monthly;1;01/01/2016;times;2
weekly;1;21/01/2016;times;4
How can I parse this by the ; delimiter into different columns?
one way to do it would be to pull it into pandas, delimit by semicolon, and put it back into SQL Server. See below for an example which I tested.
TEST DATA SETUP
CODE
import sqlalchemy as sa
import urllib
import pandas as pd
server = 'yourserver'
read_database = 'db_to_read_data_from'
write_database = 'db_to_write_data_to'
read_tablename = 'table_to_read_from'
write_tablename = 'table_to_write_to'
read_params = urllib.quote_plus("DRIVER={SQL Server};SERVER="+server+";DATABASE="+read_database+";TRUSTED_CONNECTION=Yes")
read_engine = sa.create_engine("mssql+pyodbc:///?odbc_connect=%s" % read_params)
write_params = urllib.quote_plus("DRIVER={SQL Server};SERVER="+server+";DATABASE="+write_database+";TRUSTED_CONNECTION=Yes")
write_engine = sa.create_engine("mssql+pyodbc:///?odbc_connect=%s" % write_params)
#Read from SQL into DF
Table_DF = pd.read_sql(read_tablename, con=read_engine)
#Delimit by semicolon
parsed_DF = Table_DF['string_column'].apply(lambda x: pd.Series(x.split(';')))
#write DF back to SQL
parsed_DF.to_sql(write_tablename,write_engine,if_exists='append')
RESULT
Related
I am attempting to write a script that will allow me to insert values from an uploaded dataframe into a table inside of an Oracle DB; but my issue lies with
too many columns to hard-code
columns aren't one-to-one
What I'm hoping for is a way to write out the columns, check to see if they sync with the columns of my dataframe and from there use an INSERT VALUES sql statement to input the values from the dataframe to the ODS table.
so far these are the important parts of my script:
import pandas as pd
import cx_Oracle
import config
df = pd.read_excel("Employee_data.xlsx")
conn = None
try:
conn = cx_Oracle.connect(config.username, config.password, config.dsn, encoding=config.encoding)
except cx_Oracle.Error as error:
print(error)
finally:
cursor = conn.cursor
sql = "SELECT * FROM ODSMGR.EMPLOYEE_TABLE"
cursor.execute(sql)
data = cursor.fetchall()
col_names = []
for i in range(0, len(cursor.description)):
col_names.append(cursor.description[i][0])
#instead of using df.columns I use:
rows = [tuple(x) for x in df.values]
which prints my ODS column names, and allows me to conveniently store my rows from the df in an array but I'm at a loss for how to import these to the ODS. I found something like:
cursor.execute("insert into ODSMGR.EMPLOYEE_TABLE(col1,col2) values (:col1, :col2)", {":col1df":df, "col2df:df"})
but that'll mean I'll have to hard-code everything which wouldn't be scalable. I'm hoping I can get some sort of insight to help. It's just difficult since the columns aren't 1-to-1 and that there is some compression/collapsing of columns from the DF to the ODS but any help is appreciated.
NOTE: I've also attempted to use SQLalchemy but I am always given an error "ORA-12505: TNS:listener does not currently know of SID given in connect descriptor" which is really strange given that I am able to connect with cx_Oracle
EDIT 1:
I was able to get a list of columns that share the same name; so after running:
import numpy as np
a = np.intersect1d(df.columns, col_names)
print("common columns:", a)
I was able to get a list of columns that the two datasets share.
I also tried to use this as my engine:
engine = create_engine("oracle+cx_oracle://username:password#ODS-test.domain.com:1521/?ODS-Test")
dtyp = {c:types.VARCHAR(df[c].str.len().max())
for c in df.columns[df.dtypes=='object'].tolist()}
df.to_sql('ODS.EMPLOYEE_TABLE', con = engine, dtype=dtyp, if_exists='append')
which has given me nothing but errors.
I am interested to loop through column to convert into processed series.
Below is an example of two row, four columns data frame:
import pandas as pd
from rapidfuzz import process as process_rapid
from rapidfuzz import utils as rapid_utils
data = [['r/o ac. nephritis. /. nephrotic syndrome', ' ac. nephritis. /. nephrotic syndrome',1,'ac nephritis nephrotic syndrome'], [ 'sternocleidomastoid contracture','sternocleidomastoid contracture',0,"NA"]]
# Create the pandas DataFrame
df_diagnosis = pd.DataFrame(data, columns = ['diagnosis_name', 'diagnosis_name_edited','is_spell_corrected','spell_corrected_value'])
I want to use spell_corrected_value column if is_spell_corrected column is more than 1. Else, use diagnosis_name_edited
At the moment, I have following code to directly use diagnosis_name_edited column. How do I make into if-else/lambda check for is_spell_corrected column?
unmapped_diag_series = (rapid_utils.default_process(d) for d in df_diagnosis['diagnosis_name_edited'].astype(str)) # characters (generator)
unmapped_processed_diagnosis = pd.Series(unmapped_diag_series) #
Thank you.
If I get you right, try out this fast solution using numpy.where:
df_diagnosis['new_column'] = np.where(df_diagnosis['is_spell_corrected'] > 1, df_diagnosis['spell_corrected_value'], df_diagnosis['diagnosis_name_edited'])
I'm trying to ingest a df I created from a json response into an existing table (the table is currently empty because I can't seem to get this to work)
The df looks something like the below table:
index
clicks_affiliated
0
3214
1
2221
but I'm seeing the following error:
snowflake.connector.errors.ProgrammingError: 000904 (42000): SQL
compilation error: error line 1 at position 94
invalid identifier '"clicks_affiliated"'
and the column names in snowflake match to the columns in my dataframe.
This is my code:
import pandas as pd
from snowflake.sqlalchemy import URL
from sqlalchemy import create_engine
import snowflake.connector
from snowflake.connector.pandas_tools import write_pandas, pd_writer
from pandas import json_normalize
import requests
df_norm = json_normalize(json_response, 'reports')
#I've tried also adding the below line (and removing it) but I see the same error
df = df_norm.reset_index(drop=True)
def create_db_engine(db_name, schema_name):
engine = URL(
account="ab12345.us-west-2",
user="my_user",
password="my_pw",
database="DB",
schema="PUBLIC",
warehouse="WH1",
role="DEV"
)
return engine
def create_table(out_df, table_name, idx=False):
url = create_db_engine(db_name="DB", schema_name="PUBLIC")
engine = create_engine(url)
connection = engine.connect()
try:
out_df.to_sql(
table_name, connection, if_exists="append", index=idx, method=pd_writer
)
except ConnectionError:
print("Unable to connect to database!")
finally:
connection.close()
engine.dispose()
return True
print(df.head)
create_table(df, "reporting")
So... it turns out I needed to change my columns in my dataframe to uppercase
I've added this after the dataframe creation to do so and it worked:
df.columns = map(lambda x: str(x).upper(), df.columns)
I need help in converting a row of comma separated values into separate rows and then saved to a file.
Example:
R1,R2,R3
to
R1
R2
R3
This is what I have but all the files I'm creating have the values in one row separated by commas. Just trying to add code so that the files have the values stored in one column as specified above.
import pandas as pd
filename = 'my_filename.xlsx'
df = pd.read_excel(filename,sheet_name='Sheet1')
num_of_rows = len(df)
ref_des = df['Reference Designator']
i = 0
while i < num_of_rows:
mpn = df.loc[i]['MFG_PART_NUMBER']
if "/" in mpn:
mpn = mpn.replace('/','_')
new_filename = mpn + '.lst'
with open(new_filename,'wt') as f:
f.write(ref_des[i])
i = i+1
Given a CSV file like (no headers)
R1,R2,R3
Try:
import pandas as pd
df = pd.read_csv('test_csv.csv', header=None)
df = df.T
print(df.T)
Outputs:
0
0 R1
1 R2
2 R3
Pyspark script crashes when i use collect() or show() in pyspark. My dataframe has only 570 rows, so i don't uderstand what is happening.
I have a Dataframe and i have created a functions that extracts a list with distinct rows from it. It was working fine, then suddenly i had an error:
py4j.protocol.Py4JJavaError: An error occurred while calling
0323.collectToPython
A similar error i have when i try to show() the dataframe.
Is there an alternative method to extract a list with distinct values from a dataframe?
required_list = [(col1,col2), (col1,col2)]
Sorry for not posting the code, but its a large script and its confidential.
Update:
I have a function that extract distinct values from a df:
def extract_dist(df, select_cols):
val = len(select_cols)
list_val = [row[0:val] for row in df.select(*select_cols)).distinct.na.drop().collect()]
return list_val
The function worked fine until i had the error.
I have a main script where i import these function and also another function that calculates a dataframe:
def calculate_df(df_join, v_srs, v_db, v_tbl, sql_context):
cmd = "impala-shel....'create table db.tbl as select * from v_db.v_tbl'"
os.system(cmd)
select_stm = "select * from db.tbl"
df = sql_context(select_stm)
cmd = "impala-shel....'drop table if exists db.tbl'"
os.system(cmd)
join cond = [...]
joined_df = df.join(df_join, join_cond, 'left').select(..)
df1 = joined_df.filer(...)
df2 = joined_df.filer(...)
final_df = df1.union(df2)
final_df.show() # error from show
return final_df
Main script:
import extract_dist
import calculate_df
df_join = ...extract from a hive table
for conn in details:
v_db = conn['database'].upper()
v_tbl = conn['table'].upper()
v_name = conn['descr'].upper()
if v_name in lst:
df = calculate_df(df_join, v_name, v_db, v_tbl, sqlContext)
df = df.filter(...column isin list)
df = df.filter(..).filter(..)
# extract list with distinct rows from df using dist function
df.show() # error from show
dist_list = extract_dist(df, [col1,col2]) # error from collect
for x, y in dist_list:
....
If i don't use show() the the error appears when i run the collect() method.
The same scripts worked before and suddenly failed. It there a memory issue? i have to clear memory?
SOLVED:
I have found the issue. After i created the dataframe from a table i have dropped the table.
cmd = "impala-shel....'drop table if exists db.tbl'"
os.system(cmd)
After i removed the command with drop table the script ran successfully.
I will drop the temporary table at the end of the script, after i finish with the extracted dataframe. I didn't know that if we create a dataframe and after that drop the source table i will have error afterwards.