pypyodbc sql query of cloud stored ms access database is slow when querying newest data but fast when querying oldest data - sql

I'm using pypyodbc and pandas.read_sql_query to query a cloud stored MS Access Database .accdb file.
def query_data(group_id,dbname = r'\\cloudservername\myfile.accdb',table_names=['ContainerData']):
start_time = datetime.now()
print(start_time)
pypyodbc.lowercase = False
conn = pypyodbc.connect(
r"Driver={Microsoft Access Driver (*.mdb, *.accdb)};"+
r"DBQ=" + dbname + r";")
connection_time = datetime.now()-start_time
print("Connection Time: " + str(connection_time))
querystring = ("SELECT TOP 10 Column1, Column2, Column3, Column4 FROM " +
table_names[0] + " WHERE Column0 = " + group_id)
my_data = pd.read_sql_query(querystring,conn)
print("Query Time: " + str(datetime.now()-start_time-connection_time))
conn.close()
return(my_data)
The database has about 30,000 rows. The group_id are sequential numbers from 1 to 3000 with 10 rows assigned to each group. For example, rows 1-10 in the database (oldest date) all have group_id=1. Rows 2990-3000 (newest data) all have group_id = 3000.
When I store the database locally on my PC and run query_data('1') the connection time is 0.1s and the query time is 0.01s. Similarly, running query_data('3000') the connection time is 0.2s and the query time is 0.08s.
When the database is stored on the cloud server, the connection time varies from 20-60s. When I run query_data('1') the query time is ~3 seconds. NOW THE BIG ISSUE: When I run query_data('3000') the query time i ~10 minutes!
I've tried using ORDER BY group_id DESC but that causes both queries to take ~ 10 minutes.
I've also tried changing the "Order by" group_id to Descending in the accdb itself and setting "Order by on load" to yes. Neither of these seem to change how the SQL query locates the data.
The problem is, the code I'm using almost always needs to find the newest data (e.g. group_id = max) which takes the longest amount of time to find. Is there a way to have the SQL query reverse it's searching order, so that the newest entries are looked through first, rather than the oldest entries? I wouldn't mind a 3 second (or even 1 minute) query time, but a 10 minute query time is too long. Or is there a setting I can change in the access database to change the order in which the data is stored?
I've also watched the network monitor while running the script, and python.exe steadily sends about 2kb/s and receives about 25kb/s throughout the full 10 minute duration of the script.

Related

How to improve the efficiency of below query in SQL Server?

I have a ten million level database. The client needs to read data and perform calculation.
Due to the large amount of data, if it is saved in the application cache, memory will be overflow and crash will occur.
If I use select statement to query data from the database in real time, the time may be too long and the number of operations on the database may be too frequent.
Is there a better way to read the database data? I use C++ and C# to access SQL Server database.
My database statement is similar to the following:
SELECT TOP 10 y.SourceName, MAX(y.EndTimeStamp - y.StartTimeStamp) AS ProcessTimeStamp
FROM
(
SELECT x.SourceName, x.StartTimeStamp, IIF(x.EndTimeStamp IS NOT NULL, x.EndTimeStamp, 134165256277210658) AS EndTimeStamp
FROM
(
SELECT
SourceName,
Active,
LEAD(Active) OVER(PARTITION BY SourceName ORDER BY TicksTimeStamp) NextActive,
TicksTimeStamp AS StartTimeStamp,
LEAD(TicksTimeStamp) OVER(PARTITION BY SourceName ORDER BY TicksTimeStamp) EndTimeStamp
FROM Table1
WHERE Path = N'App1' and TicksTimeStamp >= 132165256277210658 and TicksTimeStamp < 134165256277210658
) x
WHERE (x.Active = 1 and x.NextActive = 0) OR (x.Active = 1 and x.NextActive = null)
) y
GROUP BY y.SourceName
ORDER BY ProcessTimeStamp DESC, y.SourceName
The database structure is roughly as follows:
ID Path SourceName TicksTimeStamp Active
1 App1 Pipe1 132165256277210658 1
2 App1 Pipe1 132165256297210658 0
3 App1 Pipe1 132165956277210658 1
4 App2 Pipe2 132165956277210658 1
5 App2 Pipe2 132165956277210658 0
I use the ExecuteReader of C #. The same SQL statement runs on SQL Management for 4s, but the time returned by the ExecuteReader is 8-9s. Does the slow time have anything to do with this interface?
I don't really 'get' the entire query but I'm wondering about this part:
WHERE (x.Active = 1 and x.NextActive = 0) OR (x.Active = 1 and x.NextActive = null)
SQL doesn't really like OR's so why not convert this to
WHERE x.Active = 1 and ISNULL(x.NextActive, 0) = 0
This might cause a completely different query plan. (or not)
As CharlieFace mentioned, probably best to share the query plan so we might get an idea of what's going on.
PS: I'm also not sure what those 'ticksTimestamps' represent, but it looks like you're fetching a pretty wide range there, bigger volumes will also cause longer processing time. Even though you only return the top 10 it still has to go through the entire range to calculate those durations.
I agree with #Charlieface. I think the index you want is as follows:
CREATE INDEX idx ON Table1 (Path, TicksTimeStamp) INCLUDE (SourceName, Active);
You can add both indexes (with different names of course) and see which one the execution engine chooses.
I can suggest adding the following index which should help the inner query using LEAD:
CREATE INDEX idx ON Table1 (SourceName, TicksTimeStamp, Path) INCLUDE (Active);
The key point of the above index is that it should allow the lead values to be rapidly computed. It also has an INCLUDE clause for Active, to cover the entire select.

Ram overflow and long loading times, SQL query big data

I have an existing database from which I need to extract a single record that contains a total of 10 GB of data. I have tried to load the data with
conn = sqlite(databaseFile, 'readonly')
GetResult = [
'SELECT result1, result2 ...... FROM Result '...
'WHERE ResultID IN ......'
];
Data = fetch(conn, GetResult)
With this query, the working memory increases (16GB) until it is full, and then the software crashes.
I also tried to limit the result with
'LIMIT 10000'
at the end of the query and browse the results by offset. This works, but it takes 3 hours (calculated from 20 individual results) to get all the results. (Database can not be changed)
Maybe someone of you has an idea to get the data faster or in one query.

Calculation based on values in 2 different rows

I have a table in MS Access which has stock prices arranged like
Ticker1, 9:30:00, $49.01
Ticker1, 9:30:01, $49.08
Ticker2, 9:30:00, $102.02
Ticker2, 9:30:01, $102.15
and so on.
I need to do some calculation where I need to compare prices in 1 row, with the immediately previous price (and if the price movement is greater than X% in 1 second, I need to report the instance separately).
If I were doing this in Excel, it's a fairly simple formula. I have a few million rows of data, so that's not an option.
Any suggestions on how I could do it in MS Access?
I am open to any kind of solutions (with or without SQL or VBA).
Update:
I ended up trying to traverse my records by using ADODB.Recordset in nested loops. Code below. I though it was a good idea, and the logic worked for a small table (20k rows). But when I ran it on a larger table (3m rows), Access ballooned to 2GB limit without finishing the task (because of temporary tables, the size of the original table was more like ~300MB). Posting it here in case it helps someone with smaller data sets.
Do While Not rstTickers.EOF
myTicker = rstTickers!ticker
rstDates.MoveFirst
Do While Not rstDates.EOF
myDate = rstDates!Date_Only
strSql = "select * from Prices where ticker = """ & myTicker & """ and Date_Only = #" & myDate & "#" 'get all prices for a given ticker for a given date
rst.Open strSql, cn, adOpenKeyset, adLockOptimistic 'I needed to do this to open in editable mode
rst.MoveFirst
sPrice1 = rst!Open_Price
rst!Row_Num = i
rst.MoveNext
Do While Not rst.EOF
i = i + 1
rst!Row_Num = i
rst!Previous_Price = sPrice1
sPrice2 = rst!Open_Price
rst!Price_Move = Round(Abs((sPrice2 / sPrice1) - 1), 6)
sPrice1 = sPrice2
rst.MoveNext
Loop
i = i + 1
rst.Close
rstDates.MoveNext
Loop
rstTickers.MoveNext
Loop
If the data is always one second apart without any milliseconds, then you can join the table to itself on the Ticker ID and the time offsetting by one second.
Otherwise, if there is no sequence counter of some sort to join on, then you will need to create one. You can do this by doing a "ranking" query. There are multiple approaches to this. You can try each and see which one works the fastest in your situation.
One approach is to use a subquery that returns the number of rows are before the current row. Another approach is to join the table to itself on all the rows before it and do a group by and count. Both approaches produce the same results but depending on the nature of your data and how it's structured and what indexes you have, one approach will be faster than the other.
Once you have a "rank column", you do the procedure described in the first paragraph, but instead of joining on an offset of time, you join on an offset of rank.
I ended up moving my data to a SQL server (which had its own issues). I added a row number variable (row_num) like this
ALTER TABLE Prices ADD Row_Num INT NOT NULL IDENTITY (1,1)
It worked for me (I think) because my underlying data was in the order that I needed for it to be in. I've read enough comments that you shouldn't do it, because you don't know what order is the server storing the data in.
Anyway, after that it was a join on itself. Took me a while to figure out the syntax (I am new to SQL). Adding SQL here for reference (works on SQL server but not Access).
Update A Set Previous_Price = B.Open_Price
FROM Prices A INNER JOIN Prices B
ON A.Date_Only = B.Date_Only
WHERE ((A.Ticker=B.Ticker) AND (A.Row_Num=B.Row_Num+1));
BTW, I had to first add the column Date_Only like this (works on Access but not SQL server)
UPDATE Prices SET Prices.Date_Only = Format([Time_Date],"mm/dd/yyyy");
I think the solution for row numbers described by #Rabbit should work better (broadly speaking). I just haven't had the time to try it out. It took me a whole day to get this far.

SQL Select/From/Where Run Speed

I have a program that is pulling data from a Visual FoxPro table and dumping into a Dataset with VB.net. My connection string works great, and the query I'm using usually runs with respectable speed. As I've ran it more, however, I've learned that there is a large amount of "bad" data in my table. So now, I'm trying to refine my query to buffer against the "bad" data, but what I thought would be a very small tweak has yielded massive performance losses, and I'm not particularly sure why.
My original query is:
'Pull desired columns for orders that have not "shipped" and were received in past 60 days.
'To "ship", an order must qualify with both an updated ship date and Sales Order #.
sqlSelect = "SELECT job_id,cust_id,total_sale,received,due,end_qty,job_descr,shipped,so "
sqlFrom = "FROM job "
sqlWhere = "WHERE fac = 'North Side' AND shipped < {12/30/1899} AND so = '' AND received >= DATE()-60;"
sql = sqlSelect & sqlFrom & sqlWhere
This has a run-time of about 20 seconds; while I'd prefer it to be quicker, it's not a problem. In my original testing (and occasional debugging), I replaced sqlWhere with sqlWhere = "WHERE job_id = 127350". This runs pretty much instantaneously.
Now the problem block: Once I replaced sqlWhere with
'Find jobs that haven't "shipped" OR were received within last 21 days.
'Recently shipped items are desired in results.
sqlWhere = "WHERE fac = 'North Side' AND ((shipped < {12/30/1899} AND so = '') OR received >= DATE()-21);"
My performance jumped to about 3 min 40 sec. This time is almost exactly the same as the time to run with sqlWhere = "WHERE received >= DATE();".
I'm not the moderator of these tables; I'm merely pulling from them to create a series of reports for our users. My best guess is that the received field is not indexed, this is the cause of my performance drop-off. But while my first search returns about 100 records, pulling the jobs only from today returns about 5, and still takes about 11x as long.
So my question is three part:
1) Would someone be able to explain the phenomenon I'm experiencing right now? I feel like I'm somewhat on the right track, but my knowledge of SQL has been limited to circumstantial use within other languages...
2) Is there something I'm missing, or some better way to obtain the results I need? There are a large volume of records that haven't "shipped", but simply because the user only input a shipped date or s/o, and didn't do the other. I need a way to view very recent orders (regardless of "shipped" status), and then also view less recent orders that have "bad" data, so I can get the user in the habit of cleaning up the data.
3) Is it bad SQL practice to overconstrain a WHERE clause? If I run fifteen field comparisons, joined together with nested ANDs/ORs, am I wasting my time when I could be doing something much cleaner?
Many thanks,
B
If you are looking for a non-indexed record in your WHERE string, the SQL engine must do a table scan, i.e. - look at every record in the table.
The difference between the two queries is having the OR instead of the AND. When you have a non-indexed column in an AND, the SQL engine can use the indexes to narrow down the number of records it has to look at for the non-indexed column. When you have an OR, it now must look at every record in the table and compare on that column.
Adding an index on the Received column would probably fix the performance issue.
In general, there are two things you don't want to have happen in your WHERE clause.
1. A primary condition on an non-indexed column
2. Using a calculation on a column. For example, doing WHERE Shipped-2 < date() is often worse than doing Shipped < Date() + 2, because the former doesn't typically allow the index to be used.
Refining your query through multiple WHERE clauses is generally a good thing. The fewer records you need to return to your application the better your performance will be, but you need to have appropriate indexing in place.

Multiple COUNT expressions based on different criteria in the same query

I am trying to create a summary based on data from one table in Access, but having some expected issues which I hope someone can resolve.
Table 1 looks like this
Region || Case ID || Tasked || Visited
For each region I would like to show three fields.
a Total Column (count of case IDs)
Total Tasked (where Tasked = Yes)
Total Visited (where Total Visited = Yes).
Creating the Total Column is fine, however, once I started adding in WHERE clauses = Yes, I obviously lose data in the total column. Is there a way around this?
I was intrigued by E Mett's test results regarding performance so I tried to reproduce them. Unfortunately, I could not.
I ran the tests against a table with 1 million rows residing in a back-end .accdb file on a network share. I ran three tests (re-loading the front-end .accdb each time) and averaged the results.
SELECT
COUNT(*) AS TotalRows,
SUM(IIf(Tasked=True,1,0)) AS TaskedRows
FROM TestData
Test runs: 24.8, 24.0, 23.8 seconds
Average: 24.2 seconds
SELECT
COUNT(*) AS TotalRows,
SUM(Abs(Tasked)) AS TaskedRows
FROM TestData
Test runs: 22.3, 23.8, 24.9 seconds
Average: 23.7 seconds
Based on those results SUM(Abs()) might be very slightly faster than SUM(IIf()), but certainly not 12x faster.
If speed is an issue and you had the foresight to put an index on the [Tasked] field, then a truly faster approach would be
SELECT
DCount("*", "TestData") AS TotalRows,
DCount("*", "TestData", "Tasked=True") AS TaskedRows
Test runs: 2.1, 3.5, 2.3 seconds
Average: 2.6 seconds
As always, query performance tuning can be an interesting game in itself.
Use the following:
SUM(ABS(Tasked)) AS TotalTasked
The ABS function will convert the -1 to 1
Abs is about 12 times faster than IIf! If you have thousands of records, it may make a difference.