I have an existing database from which I need to extract a single record that contains a total of 10 GB of data. I have tried to load the data with
conn = sqlite(databaseFile, 'readonly')
GetResult = [
'SELECT result1, result2 ...... FROM Result '...
'WHERE ResultID IN ......'
];
Data = fetch(conn, GetResult)
With this query, the working memory increases (16GB) until it is full, and then the software crashes.
I also tried to limit the result with
'LIMIT 10000'
at the end of the query and browse the results by offset. This works, but it takes 3 hours (calculated from 20 individual results) to get all the results. (Database can not be changed)
Maybe someone of you has an idea to get the data faster or in one query.
Related
I'm using pypyodbc and pandas.read_sql_query to query a cloud stored MS Access Database .accdb file.
def query_data(group_id,dbname = r'\\cloudservername\myfile.accdb',table_names=['ContainerData']):
start_time = datetime.now()
print(start_time)
pypyodbc.lowercase = False
conn = pypyodbc.connect(
r"Driver={Microsoft Access Driver (*.mdb, *.accdb)};"+
r"DBQ=" + dbname + r";")
connection_time = datetime.now()-start_time
print("Connection Time: " + str(connection_time))
querystring = ("SELECT TOP 10 Column1, Column2, Column3, Column4 FROM " +
table_names[0] + " WHERE Column0 = " + group_id)
my_data = pd.read_sql_query(querystring,conn)
print("Query Time: " + str(datetime.now()-start_time-connection_time))
conn.close()
return(my_data)
The database has about 30,000 rows. The group_id are sequential numbers from 1 to 3000 with 10 rows assigned to each group. For example, rows 1-10 in the database (oldest date) all have group_id=1. Rows 2990-3000 (newest data) all have group_id = 3000.
When I store the database locally on my PC and run query_data('1') the connection time is 0.1s and the query time is 0.01s. Similarly, running query_data('3000') the connection time is 0.2s and the query time is 0.08s.
When the database is stored on the cloud server, the connection time varies from 20-60s. When I run query_data('1') the query time is ~3 seconds. NOW THE BIG ISSUE: When I run query_data('3000') the query time i ~10 minutes!
I've tried using ORDER BY group_id DESC but that causes both queries to take ~ 10 minutes.
I've also tried changing the "Order by" group_id to Descending in the accdb itself and setting "Order by on load" to yes. Neither of these seem to change how the SQL query locates the data.
The problem is, the code I'm using almost always needs to find the newest data (e.g. group_id = max) which takes the longest amount of time to find. Is there a way to have the SQL query reverse it's searching order, so that the newest entries are looked through first, rather than the oldest entries? I wouldn't mind a 3 second (or even 1 minute) query time, but a 10 minute query time is too long. Or is there a setting I can change in the access database to change the order in which the data is stored?
I've also watched the network monitor while running the script, and python.exe steadily sends about 2kb/s and receives about 25kb/s throughout the full 10 minute duration of the script.
I have the following query in SQL Server:
select * from sales
This returns about a billion rows, and I need to process the data. I would like to show the progress while the data is being processed using something like the following:
res=conn.execute('select * from sales s1')
total_rows = ?
processed_rows = 0
step_size = 10000
while True:
data = res.fetchmany(step_size)
processed_rows += step_size
print "Progress: %s / %s" % (processed_rows, total_rows)
Is there a way to get the total number of rows in a SQL query without running another query (or doing an operation such as len(res.fetchall()), which will add a lot of overhead to the above)?
Note: I'm not interested in paginating here (which is what should be done). This is more a question to see if it's possible to get the TOTAL ROW COUNT in a query in SQL Server BEFORE paginating/processing the data.
I have a table with 2 integer fields x,y and few millions of rows.
The fields are created with the following code:
Field.newBuilder("x", LegacySQLTypeName.INTEGER).setMode(Field.Mode.NULLABLE).build();
If I run the following from the web:
SELECT x,y FROM [myproject:Test.Test] where x=1 LIMIT 50
Query Editor: "Valid: This query will process 64.9 MB when run."
compared to:
SELECT x FROM [myproject:Test.Test] where x=1 LIMIT 50
Query Editor: " Valid: This query will process 32.4 MB when run."
It scans more than double of the original data scanned.
I would expect it will first find the relevant rows based on where clause and then bring the extra field without scanning the entire second field.
Any inputs on why it doubles the data scanned and how to avoid it will be appreciated.
In my application I have hundred of possible fields which I need to fetch for a very small number of rows (50) which answer the query.
Does this means I will need to processed all fields data?
* I'm aware how columnar database works, but wasn't aware for the huge price when you want to brings lots of fields based on a very specific where clause.
The following link provide very clear answer:
best-practices-performance-input
BigQuery does not have a concept of index or something like that. When you query a field column, BigQuery will scan through all the values of that column and then make the operations you want (for a deeper deep understanding they have some pretty cool posts about the inner workings of BQ).
That means that when you select x and y where x = 1, BQ will read through all values of x and y and then find where x = 1.
This ends up being an amazing feature of BQ, you just load your data there and it just works. It does force you to be aware on how much data you retrieve from each query. Queries of the type select * from table should be used only if you really need all columns.
How do we match columns based on condition of closeness to value.
This requires Complex Query / Range Comparison / Multiple Joins conditions.
Getting Query size exceeded 2GB error.
Tables :
InvDetails1 / InvDetails2 / INVDL / ExpectedResult
Field Relation :
InvDetails1.F1 = InDetails2.F3
InvDetails2.F5 = INVDL.F1
INVDL.DLID = ExpectedResult.DLID
ExpectedResult.Total - 1 < InvDetails1.F6 < ExpectedResult.Total + 1
left(InvDetails1.F21,10) = '2013-03-07'
Return Results where Number of records from ExpectedResult is only 1.
Group by InvDetails1.F1 , count(ExpectedResult.DLID) works.
From this result.
Final Result :
InvDetails1.F1 , InvDetails1.F16 , ExpectedResult.DLID , ExpectedResult.NMR
ExpectedResult - has millions of rows.
InvDetails - few hundred thousands
If I was in that situation and was finding that my query was "hitting a wall" at 2GB then one thing I would try would be to create a separate saved Select Query to isolate the InvDetails1 records just for the specific date in question. Then I would use that query instead of the full InvDetails1 table when joining to the other tables.
My reasoning is that the query optimizer may not be able to use your left(InvDetails1.F21,10) = '2013-03-07' condition to exclude InvDetails1 records early in the execution plan, possibly causing the query to grow much larger than it really needs to (internally, while it is being processed). Forcing the date selection to the beginning of the process by putting it in a separate (prerequisite) query may keep the size of the "main" query down to a more feasible size.
Also, if I found myself in the situation where my queries were getting that big I would also keep a watchful eye on the size of my .accdb (or .mdb) file to ensure that it does not get too close to 2GB. I've never had it happen myself, but I've heard that database files that hit the 2GB barrier can result in some nasty errors and be rather "challenging" to recover.
Can someone explain the difference between these 2 simple queries ...
SET ROWCOUNT = 10
select * from t_Account order by account_date_last_maintenance
and this one
select * from t_Account order by account_date_last_maintenance
SET ROWCOUNT = 10
when executed both return only 10 rows, but the rows are different. There are millions of rows in the table if that matters. Also, the first query runs consistently 20% longer.
Thanks everyone
When you execute SET ROWCOUNT 10 you are telling SQL to stop the query after 10 results are returned. Your first SQL statement is the correct syntax (with the exception of the first line which should read SET ROWCOUNT 10).
The second statement as written will return all of the values ordered when initially executed and then set the row count to 10, so any subsequent execution will return the first 10 items.
ROWCOUNT must be set to 0 to get things back to "normal" execution.
As to why things were returning differently, the data might not be processed the same every time and given the size of your dataset it is most likely that you might sometimes get matching results, but it is not a sure thing. If you want consistent results and only want the first 10 results I would recommend using TOP.