Insert mix of existing and non-existing data frame to sql

Insert mix of existing and non-existing data frame to sql - sql

I would like to write forecasted data into a sql-server using R and RODBC. Each forecast are for the next six hours and I would like to only save the newest generation of each foreacst. Illustrated here:
set.seed(1)
# First forecast at 00:00:00
df.0 <- data.frame(Dates = seq.POSIXt(from = as.POSIXct("2015-10-29 00:00:00"),
to = as.POSIXct("2015-10-29 5:00:00"), by = "hour"),
Value = runif(6, min = 0, max = 6))
# Second forecast at 01:00:00
df.1 <- data.frame(Dates = seq.POSIXt(from = as.POSIXct("2015-10-29 01:00:00"),
to = as.POSIXct("2015-10-29 6:00:00"), by = "hour"),
Value = runif(6, min = 0, max = 6))
Now, at 00:00:00 I would save my first forecast into my data base dbdata:
require(RODBC)
sqlSave(channel = dbdata, data = df.0, tablename = "forecasts",
append = TRUE, rownames = FALSE, fast = FALSE, verbose = TRUE)
# Query: INSERT INTO "forecast" ( "Dates", "Values") VALUES
( '2015-10-29 00:00:00', '1.59')
# Query: INSERT INTO "forecast" ( "Dates", "Values") VALUES
( '2015-10-29 00:00:00', '2.23')
# etc for all 6 forecasts
Now, at 01:00:00 I get a new forecast. I want to save/update this forecast, so I replace all the values from 01:00:00 to 05:00:00 and the add the newest forecast at 06:00:00 as well.
The update works well - so I can overwrite the files - but update can't insert the last 06:00:00 forecast.
sqlUpdate(channel = dbdata, dat = df.1, tablename = "forecasts",
fast = FALSE, index = c("Dates"), verbose = TRUE)
# Query: UPDATE "forecast" SET "Value" = 5.668 WHERE "Dates" = '2015-10-29 00:01:00'
# etc. until
# Error in sqlUpdate(channel = prognoser, dat = df.1[, ],
# table = "forecast", :
# [RODBC] ERROR: Could not SQLExecDirect
# 'UPDATE " "forecast" SET "Value" = 1.059 WHERE "Dates" = '2015-10-29 06:00:00'
So, this can be probably be solved in a lot of ways - but what are the good ways to do this?
I think there must be better ways than to read the table and find out how long the forecast is in the database. Then split the new data into an update and a save part, and write these in.
It is a t-sql, microsoft server. The tables are in the same database - but this a pure coincidence. Which means this: RODBC: merge tables from different databases (channel) shouldn't be a issue and perhaps I can get away with a t-sql "MERGE INTO". But next time I probably won't be able to.

You can try making a conditional insert followed by an update, the conditional insert means you only insert if the Date does not exist yet and the update always succeeds (you do some unnecessary updates if the value was succesfully inserted)
Something like the following for the conditional insert:
INSERT INTO "forecast" ( "Dates", "Values") VALUES ( '2015-10-29 00:00:00', '2.23') where not exists (select 1 from "forecast" where "Dates"='2015-10-29 00:00:00')

Related

INSERT values into table using cursor.execute

I'm writing some code that will pull data from an API and insert the records into a table for me.
I'm unsure how to go about formatting my insert statement. I want to insert values where there is no existing match in the table (based on date), and I don't want to insert values where the column opponents = my school's team.
import datetime
import requests
import cx_Oracle
import os
from pytz import timezone
currentYear = 2020
con = Some_datawarehouse
cursor = con.cursor()
json_obj = requests.get('https://api.collegefootballdata.com/games?year='+str(currentYear)+'&seasonType=regular&team=myteam')\
.json()
for item in json_obj:
EVENTDATE = datetime.datetime.strptime(item['start_date'], '%Y-%m-%dT%H:%M:%S.%fZ').date()
EVENTTIME = str(datetime.datetime.strptime(item['start_date'], '%Y-%m-%dT%H:%M:%S.%fZ').replace(tzinfo=timezone('EST')).time())
FINAL_SCORE = item.get("home_points", None)
OPPONENT = item.get("away_team", None)
OPPONENT_FINAL_SCORE = item.get("away_points", None)
cursor.execute('''INSERT INTO mytable(EVENTDATE,EVENTTIME,FINAL_SCORE,OPPONENT,OPPONENT_FINAL_SCORE) VALUES (:1,:2,:3,:4,:5)
WHERE OPPONENT <> 'my team'
AND EVENTDATE NOT EXISTS (SELECT EVENTDATE FROM mytable);''',
[EVENTDATE,EVENTTIME,FINAL_SCORE,OPPONENT,OPPONENT_FINAL_SCORE])
con.commit()
con.close
This may be more of an ORACLE SQL rather than python question, but I'm not sure if cursor.execute can accept MERGE statements. I also recognize that the WHERE statement will not work here, but this is more of an idea of what I'm trying to accomplish.

change the sql query to this :
INSERT INTO mytable(EVENTDATE,EVENTTIME,FINAL_SCORE,OPPONENT,OPPONENT_FINAL_SCORE)
SELECT * FROM (VALUES (:1,:2,:3,:4,:5)) vt(EVENTDATE,EVENTTIME,FINAL_SCORE,OPPONENT,OPPONENT_FINAL_SCORE)
WHERE vt.OPPONENT <> 'my team'
AND vt.EVENTDATE NOT IN (SELECT EVENTDATE FROM mytable);

column names are coming as 0,1,2,3 while executing SQL Query via python in snowflake using snowflake connector

I am executing a sql query from a python script to retrieve the data from snowflake in windows 10 but the resulting query is missing column names and its getting replaced by 0,1,2,3 so on. While executing query in snowflake interface and downloading csv is giving the columns in the file. I am passing column names as Aliases in my query
Below is code
def _CONSUMPTION(con):
data2 = con.cursor().execute("""select sd.sales_force_lvl_1_code "Plan-To Code",sd.sales_force_lvl_1_desc "Plan-To Description",pd.matl_code "Product Code",pd.matl_desc "Product Description",pd.ean_upc_code "UPC",dd.fiscal_week_desc "Fiscal Week Description",f.unit_sales_qty "Sales Units",f.incr_units_qty "Incremental Units"
from DW.consumption_fact1 f, DW.market_dim md, DW.matl_dim pd, DW.fiscal_week_dim dd, (select sales_force_lvl_1_code,max(sales_force_lvl_1_desc) sales_force_lvl_1_desc from DW.mv_us_sales_force_dim group by sales_force_lvl_1_code) sd
where dd.fiscal_week_key = f.fiscal_week_key
and pd.matl_key = f.matl_key
and md.market_key = f.market_key
and sd.sales_force_lvl_1_code = md.curr_sales_force_lvl_1_code
and dd.fiscal_week_key between (select curr_fy_week_key-6 from DW.curr_date_lkp) and (select curr_fy_week_key-1 from DW.curr_date_lkp)
and f.company_key = 6006
and (f.unit_sales_qty <> 0 and f.sales_amt <> 0)
and md.curr_sales_force_lvl_1_code is not null
UNION
select '5000016240' "Plan-To Code", 'AWG TOTAL' "Plan-To Description",pd.matl_code "Product Code",pd.matl_desc "Product Description",pd.ean_upc_code "UPC",dd.fiscal_week_desc "Fiscal Week Description",f.unit_sales_qty "Sales Units",f.incr_units_qty "Incremental Units"
from DW.consumption_fact1 f, DW.market_dim md, DW.matl_dim pd, DW.fiscal_week_dim dd
where dd.fiscal_week_key = f.fiscal_week_key
and pd.matl_key = f.matl_key
and md.market_key = f.market_key
and dd.fiscal_week_key between (select curr_fy_week_key-6 from DW.curr_date_lkp) and (select curr_fy_week_key-1 from DW.curr_date_lkp)
and f.company_key = 6006
and (f.unit_sales_qty <> 0 and f.sales_amt <> 0)
and md.market_code = '20267'""").fetchall()
df = pd.DataFrame(data2)
df.head(5)
df.to_csv('CONSUMPTION.csv',index = False)

Looking [at the docs], seems the easiest way is to use the cursor method .fetch_pandas_all():
query = "SELECT 1 a, 2 b, 'a' c UNION ALL SELECT 7,4,'snow'"
cur = connection.cursor()
cur.execute(query).fetch_pandas_all()
Or if you want to dump the results into a CSV, just do so as in the question:
query = "SELECT 1 a, 2 b, 'a' c UNION ALL SELECT 7,4,'snow'"
cur = connection.cursor()
df = cur.execute(query).fetch_pandas_all()
df.to_csv('x.csv', index = False)
Visualized:

Looks like you haven’t defined the column methods in your code to define the data frame.
My recommendation will be to add column methods first df.columns
In addition refer snowflake page for details
https://docs.snowflake.com/en/user-guide/python-connector-pandas.html
Try this
import pandas as pd
def fetch_pandas_old(cur, sql):
cur.execute(sql)
rows = 0
while True:
dat = cur.fetchmany(50000)
if not dat:
break
df = pd.DataFrame(dat, columns=cur.description)
rows += df.shape[0]
print(rows)

A nice way to extract the column headings from the cursor description and save in a pandas df using the Snowflake connector (also works for psycopg2 btw) is as follows:
#Create the connection
def connect_snowflake(uname, pword, acct, role_name, whouse, dbase, schema_name):
conn = snowflake.connector.connect(
user=uname,
password=pword,
account=acct,
role = role_name,
warehouse = whouse,
database = dbase,
schema = schema_name
)
cur = conn.cursor()
return conn, cur
Then execute your query. The cur.description object returns a list of tuples, the first of each being the column name :)
conn, cur = connect_snowflake(username, password, account_name, role, warehouse, database, schema)
cur.execute('select * from my_schema.my_table')
result =cur.fetchall()
# Extract the column names
col_names = []
for elt in cur.description:
col_names.append(elt[0])
df = pd.DataFrame(result, columns=col_names)
cur.close()
conn.close()

Writing a loop that brings in two different date formats

I need to use R for writing a query coming from a database my R environment is connected to. The structure of the query looks like this:
ALTER TABLE cph.table_id ADD PARTITION (event_date = 'YYYY-MM-DD')
LOCATION 's3://external-dwh-company-com/id-to-idl/YYYYMMDD'
So for example, today's addition would like as such:
ALTER TABLE cph.table_id ADD PARTITION (event_date = '2018-08-02')
LOCATION 's3://external-dwh-company-com/id-to-idl/20180802'
The issue is, I need to be doing this for every data going back to 03/01/2018.
So the steps would look like:
initial_query <- paste(#however the above query would be formatted with the dates)
results_query <- dbGetQuery(conn, initial_query)
But yeah, the biggest hurdle for me is 1.) Figuring out the paste formatting for that first part and 2.) Creating a loop that will allow me to run the above steps until the current date.

Consider looping through the range of days since a targeted begin date with lapply and seq, concatenating strings with sprintf and corresponding date format:
# GET DIFFERENCE BETWEEN TODAY AND DESIRED BEGIN DATE
date_diff <- Sys.Date() - as.Date("2018-03-01")
date_diff[[1]]
# 155
sql <- "ALTER TABLE cph.table_id ADD PARTITION (event_date = '%s')
LOCATION 's3://external-dwh-company-com/id-to-idl/%s'"
# RUN QUERY SEQUENTIALLY ADDING TO BEGIN DATE
output <- lapply(seq(1, date_diff[[1]]), function(i)
dbGetQuery(conn, sprintf(sql, strftime(as.Date("2018-03-01") + i, "%Y-%m-%d"),
strftime(as.Date("2018-03-01") + i, "%Y%m%d"))))

How to use variables in SQL query when using Python and pyodbc

I am using Python to extract data from SQL by using ODBC to linking Python to SQL database. when I do the query, I need to use variables in the query to make my query result changeable. For example, my code is:
import pyodbc
myConnect = pyodbc.connect('DSN=B1P HANA;UID=***;PWD=***')
myCursor = myConnect.cursor()
Start = 20180501
End = 20180501
myOffice = pd.Series([1,2,3])
myRow = myCursor.execute("""
SELECT "CALDAY" AS "Date",
"/BIC/ZSALE_OFF" AS "Office"
FROM "SAPB1P"."/BIC/AZ_RT_A212"
WHERE "CALDAY" BETWEEN 20180501 AND 20180501
GROUP BY "CALDAY","/BIC/ZSALE_OFF"
""")
Result = myRow.fetchall()
d = pd.DataFrame(columns=['Date','Office'])
for i in Result:
d= d.append({'Date': i.Date,
'Office': i.Office},
ignore_index=True)
You can see that I retrieve data from SQL database and save it into a list (Result), then I convert this list to a data frame (d).
But, my problems are:
I need to specify a start date and an end data in myCursor.execute part, something like "CALDAY" BETWEEN Start AND End
Let's say I have 100 offices in my data. Now I just need 3 of them (myOffice). So, I need to put a condition in myCursor.execute part, like myOffice in (1,2,3)
In R, I know how to deal with these two problems. the code is like:
office_clause = ""
if (myOffice != 0) {
office_clause = paste(
'AND "/BIC/ZSALE_OFF" IN (',paste(myOffice, collapse=", "),')'
)
}
a <- sqlQuery(ch,paste(' SELECT ***
FROM ***
WHERE "CALDAY" BETWEEN',Start,'AND',End,'
',office_clause1,'
GROUP BY ***
'))
But I do not know how to do this in Python. How can I do this?

You can use string formatting operations for this.
First define
query = """
SELECT
"CALDAY" AS "Date",
"/BIC/ZSALE_OFF" AS "Office"
FROM
"SAPB1P"."/BIC/AZ_RT_A212"
WHERE
"CALDAY" BETWEEN {start} AND {end}
{other_conds}
GROUP BY
"CALDAY","/BIC/ZSALE_OFF"
"""
Now you can use
myRow = myCursor.execute(query.format(
start='20180501'
end='20180501',
other_conds=''))
and
myRow = myCursor.execute(query.format(
start='20180501'
end='20180501',
other_conds='AND myOffice IN (1,2,3)'))

Update multiple rows with different values in a single SQL query

I have a SQLite database with table myTable and columns id, posX, posY. The number of rows changes constantly (might increase or decrease). If I know the value of id for each row, and the number of rows, can I perform a single SQL query to update all of the posX and posY fields with different values according to the id?
For example:
---------------------
myTable:
id posX posY
1 35 565
3 89 224
6 11 456
14 87 475
---------------------
SQL query pseudocode:
UPDATE myTable SET posX[id] = #arrayX[id], posY[id] = #arrayY[id] "
#arrayX and #arrayY are arrays which store new values for the posX and posY fields.
If, for example, arrayX and arrayY contain the following values:
arrayX = { 20, 30, 40, 50 }
arrayY = { 100, 200, 300, 400 }
... then the database after the query should look like this:
---------------------
myTable:
id posX posY
1 20 100
3 30 200
6 40 300
14 50 400
---------------------
Is this possible? I'm updating one row per query right now, but it's going to take hundreds of queries as the row count increases. I'm doing all this in AIR by the way.

There's a couple of ways to accomplish this decently efficiently.
First -
If possible, you can do some sort of bulk insert to a temporary table. This depends somewhat on your RDBMS/host language, but at worst this can be accomplished with a simple dynamic SQL (using a VALUES() clause), and then a standard update-from-another-table. Most systems provide utilities for bulk load, though
Second -
And this is somewhat RDBMS dependent as well, you could construct a dynamic update statement. In this case, where the VALUES(...) clause inside the CTE has been created on-the-fly:
WITH Tmp(id, px, py) AS (VALUES(id1, newsPosX1, newPosY1),
(id2, newsPosX2, newPosY2),
......................... ,
(idN, newsPosXN, newPosYN))
UPDATE TableToUpdate SET posX = (SELECT px
FROM Tmp
WHERE TableToUpdate.id = Tmp.id),
posY = (SELECT py
FROM Tmp
WHERE TableToUpdate.id = Tmp.id)
WHERE id IN (SELECT id
FROM Tmp)
(According to the documentation, this should be valid SQLite syntax, but I can't get it to work in a fiddle)

One way: SET x=CASE..END (any SQL)
Yes, you can do this, but I doubt that it would improve performances, unless your query has a real large latency.
If the query is indexed on the search value (e.g. if id is the primary key), then locating the desired tuple is very, very fast and after the first query the table will be held in memory.
So, multiple UPDATEs in this case aren't all that bad.
If, on the other hand, the condition requires a full table scan, and even worse, the table's memory impact is significant, then having a single complex query will be better, even if evaluating the UPDATE is more expensive than a simple UPDATE (which gets internally optimized).
In this latter case, you could do:
UPDATE table SET posX=CASE
WHEN id=id[1] THEN posX[1]
WHEN id=id[2] THEN posX[2]
...
ELSE posX END [, posY = CASE ... END]
WHERE id IN (id[1], id[2], id[3]...);
The total cost is given more or less by: NUM_QUERIES * ( COST_QUERY_SETUP + COST_QUERY_PERFORMANCE ). This way, you knock down on NUM_QUERIES (from N separate id's to 1), but COST_QUERY_PERFORMANCE goes up (about 3x in MySQL 5.28; haven't yet tested in MySQL 8).
Otherwise, I'd try with indexing on id, or modifying the architecture.
This is an example with PHP, where I suppose we have a condition that already requires a full table scan, and which I can use as a key:
// Multiple update rules
$updates = [
"fldA='01' AND fldB='X'" => [ 'fldC' => 12, 'fldD' => 15 ],
"fldA='02' AND fldB='X'" => [ 'fldC' => 60, 'fldD' => 15 ],
...
];
The fields updated in the right hand expressions can be one or many, must always be the same (always fldC and fldD in this case). This restriction can be removed, but it would require a modified algorithm.
I can then build the single query through a loop:
$where = [ ];
$set = [ ];
foreach ($updates as $when => $then) {
$where[] = "({$when})";
foreach ($then as $fld => $value) {
if (!array_key_exists($fld, $set)) {
$set[$fld] = [ ];
}
$set[$fld][] = $value;
}
}
$set1 = [ ];
foreach ($set as $fld => $values) {
$set2 = "{$fld} = CASE";
foreach ($values as $i => $value) {
$set2 .= " WHEN {$where[$i]} THEN {$value}";
}
$set2 .= ' END';
$set1[] = $set2;
}
// Single query
$sql = 'UPDATE table SET '
. implode(', ', $set1)
. ' WHERE '
. implode(' OR ', $where);
Another way: ON DUPLICATE KEY UPDATE (MySQL)
In MySQL I think you could do this more easily with a multiple INSERT ON DUPLICATE KEY UPDATE, assuming that id is a primary key keeping in mind that nonexistent conditions ("id = 777" with no 777) will get inserted in the table and maybe cause an error if, for example, other required columns (declared NOT NULL) aren't specified in the query:
INSERT INTO tbl (id, posx, posy, bazinga)
VALUES (id1, posY1, posY1, 'DELETE'),
...
ON DUPLICATE KEY SET posx=VALUES(posx), posy=VALUES(posy);
DELETE FROM tbl WHERE bazinga='DELETE';
The 'bazinga' trick above allows to delete any rows that might have been unwittingly inserted because their id was not present (in other scenarios you might want the inserted rows to stay, though).
For example, a periodic update from a set of gathered sensors, but some sensors might not have been transmitted:
INSERT INTO monitor (id, value)
VALUES (sensor1, value1), (sensor2, 'N/A'), ...
ON DUPLICATE KEY UPDATE value=VALUES(value), reading=NOW();
(This is a contrived case, it would probably be more reasonable to LOCK the table, UPDATE all sensors to N/A and NOW(), then proceed with INSERTing only those values we do have).
A third way: CTE (PostgreSQL, not sure about SQLite3)
This is conceptually almost the same as the INSERT MySQL trick. As written, it works in PostgreSQL 9.6:
WITH updated(id, posX, posY) AS (VALUES
(id1, posX1, posY1),
(id2, posX2, posY2),
...
)
UPDATE myTable
SET
posX = updated.posY,
posY = updated.posY
FROM updated
WHERE (myTable.id = updated.id);

Something like this might work for you:
"UPDATE myTable SET ... ;
UPDATE myTable SET ... ;
UPDATE myTable SET ... ;
UPDATE myTable SET ... ;"
If any of the posX or posY values are the same, then they could be combined into one query
UPDATE myTable SET posX='39' WHERE id IN('2','3','40');

In recent versions of SQLite (beginning from 3.24.0 from 2018) you can use the UPSERT clause. Assuming only existing datasets are updated having a unique id column, you can use this approach, which is similar to #LSerni's ON DUPLICATE suggestion:
INSERT INTO myTable (id, posX, posY) VALUES
( 1, 35, 565),
( 3, 89, 224),
( 6, 11, 456),
(14, 87, 475)
ON CONFLICT (id) DO UPDATE SET
posX = excluded.posX, posY = excluded.posY

I could not make #Clockwork-Muse work actually. But I could make this variation work:
WITH Tmp AS (SELECT * FROM (VALUES (id1, newsPosX1, newPosY1),
(id2, newsPosX2, newPosY2),
......................... ,
(idN, newsPosXN, newPosYN)) d(id, px, py))
UPDATE t
SET posX = (SELECT px FROM Tmp WHERE t.id = Tmp.id),
posY = (SELECT py FROM Tmp WHERE t.id = Tmp.id)
FROM TableToUpdate t
I hope this works for you too!

Use a comma ","
eg:
UPDATE my_table SET rowOneValue = rowOneValue + 1, rowTwoValue = rowTwoValue + ( (rowTwoValue / (rowTwoValue) ) + ?) * (v + 1) WHERE value = ?

To update a table with different values for a column1, given values on column2, one can do as follows for SQLite:
"UPDATE table SET column1=CASE WHEN column2<>'something' THEN 'val1' ELSE 'val2' END"

Try with "update tablet set (row='value' where id=0001'), (row='value2' where id=0002'), ...

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Insert mix of existing and non-existing data frame to sql - sql

Related

INSERT values into table using cursor.execute

column names are coming as 0,1,2,3 while executing SQL Query via python in snowflake using snowflake connector

Writing a loop that brings in two different date formats

How to use variables in SQL query when using Python and pyodbc

Update multiple rows with different values in a single SQL query

Categories

Resources