Pymongo Query gets extremely slow - pymongo

I have a MongoDB base with a table with some 12M records and an index on Data.Report.FIELD field. I am trying to fetch all of its values. If I do it in one large cursor, it just dies on the way, so I split it in slices of 100K docs.
for i in range (10):
for a in data.find({'Data.Report.FIELD': {"$gt": (i*100000), "$lt": (i+1)*100000+1}},{'Data.Report.FIELD':1}):
if ('FIELD' in a['Data']['Report'][0].keys()):
_ids.append([a['_id'], a['Data']['Report'][0]['FIELD']])
_FIELDs.append(a['Data']['Report'][0]['FIELD'])
good+=1
else:
bad+=1
print ('Done with ', i, ' hundred thousand. Time: ', time.time()-start, 'seconds.')
And what I get is something like:
Done with 0 hundred thousand. Time: 116.90340232849121 seconds.
Done with 1 hundred thousand. Time: 182.20432806015015 seconds.
Done with 2 hundred thousand. Time: 2561.886509180069 seconds.
Done with 3 hundred thousand. Time: 4840.841073274612 seconds.
What could be the reason for it geting so crazily slow after 200K documents? Is there anything I could change? Could it be a server issue?
UPD:
Indexes:
{'_id_': {'v': 2, 'key': [('_id', 1)], 'ns': 'admin.ECR0618'},
'FIELD': {'v': 2, 'key': [('Data.Report.FIELD', 1)], 'ns': 'admin.ECR0618',
'background': False}}
Basic stats:
'ns': 'admin.ECR0618',
'size': 176633446637.0,
'count': 11782003,
'avgObjSize': 14991,
'storageSize': 59065884672.0,
'capped': False,

Related

Zarr slow read speed of 43.82 GB file with Xarray

I want to look-up 8760 times for a single lat/lon combo in less than a second from 43.82 GB file of wind data containing:
8760 times (every hour in a year)
721 latitudes (every 0.25° from -90.0° to 90.0°)
1440 longitude (every 0.25° from -180.0° to 179.75°)
The best time we achieved for a single-year look-up was 16 seconds for both u100 and v100 wind speed at 100m vectors. We want to have a sub-second look-up for the whole year as such file read will need to happen on every user request in our API.
if __name__ == '__main__':
start_time = time.time()
ds = xr.open_dataset("2021.zarr", engine="zarr", chunks={"time": 50})
print(f"Took {round((time.time() - start_time) * 1000, 2)}ms")
location = ds.sel(indexers={"latitude": 53.494, "longitude": 9.979}, method='nearest')
wind_speed = (location.u100.values ** 2 + location.v100.values ** 2) ** 0.5
print(f"Wind Speed: {wind_speed} m/s")
print(f"Took {round((time.time() - start_time) * 1000, 2)}ms")
Output:
Took 94.28ms
Wind Speed: [5.8021994 5.504477 5.4270387 ... 9.563195 8.701231 9.133655 ] m/s
Took 16299.59ms
I would be very thankful for any help!

Pandas to_sql() performance related to number of columns

I noticed some odd behaviour of a script of mine which uses pandas' to_sql function to insert large numbers of rows into one of my mssql server.
The performance dramatically decreases when the number of columns exceeds 10
For example:
34484 rows x 10 columns => ~10k records per second
34484 rows x 12 columns => ~500 records per second
I use the fast_executemany Flag when establishing the conneciton, anyone got any idea!?
engine = sqlalchemy.create_engine("mssql+pyodbc:///?odbc_connect=%s?charset=utf8" % params, fast_executemany=True)
sqlalchemy_connection = engine.connect()
....
df.to_sql(name='TEST', con=sqlalchemy_connection , if_exists='append', index=False)

Apache Hbase - Fetching large row is extremely slow

I'm running an Apache Hbase Cluster on AWS EMR. I have a table that is a single column family, 75,000 columns and 50,000 rows. I'm trying to get all the column values for a single row, and when the row is not sparse, and has 75,000 values, the return time is extremely slow - it takes almost 2.5 seconds for me to fetch the data from the DB. I'm querying the table from a Lambda function running Happybase.
import happybase
start = time.time()
col = 'mycol'
table = connection.table('mytable')
row = table.row(col)
end = time.time() - start
print("Time taken to fetch column from database:")
print(end)
What can I do to make this faster? This seems incredibly slow - the return payload is 75,000 value pairs, and is only ~2MB. It should be much faster than 2 seconds. I'm looking for millisecond return time.
I have a BLOCKCACHE size of 8194kb, a BLOOMFILTER of type ROW, and SNAPPY compression enabled on this table.

Simple SELECT/WHERE query slow. Should I index a BIT field? [duplicate]

This question already has answers here:
Should I index a bit field in SQL Server?
(18 answers)
Closed 8 years ago.
The following query takes 20-25 seconds to run
SELECT * from Books WHERE IsPaperback = 1
Where IsBundle is a BIT field. There's about 500k rows, and about 400k currently have this field set.
I also have a BIT field called IsBundle and only 900 records have this set. Again, execution time is about 20-25 seconds.
How can I speed up such a simple query?
Indexing a bit column will result in two parts, true and false. If the data is split 50/50 the gain will be 'some'. When it is 90/10 and you query the 10 part, yes it will make a difference.
You should first narrow down your result set column wise. Then, if you see you just need a few columns, and you execute this query a lot, you could even include those few fields in the index. Then there is no need for a lookup in the table itself.
First of all i would implicitly call out the columns,
select
field1
, field2
, field3
from books
where IsPaperback = 1;
this seems to be a small thing, but when you use star (*) for column selection, the DB has to look up the column names prior to actually performing the call.
do you have a index on IsPaperback ? that would impact the above query more than having an index on the IsBundle
if you had a condition of IsBundle = 1 then i would think that would be need for an index on that field.
Add an Index for IsPaperback
Try making it an int, or tinyint. The latest processors actually process 32 bit words faster than bytes.
This query should take no more than a couple of milliseconds.
You should not have a separate column for IsPaperback and IsBundle. It should be a Type column where Paperback and Bundle are the vaules.
Before the query set profiling on
SET profiling = 1
After the query show profiles:
SHOW PROFILES
It seems there are some out there that do not believe this query should take only a few milliseconds.
For those that down voted this answer without understanding what I said was true.
I found a table "cities" with 332,127 Records
In this table Russia has 929 cities
These benchmarks were preformed on a GoDaddy Server IP 50.63.0.80
This is a GoDaddy Virtual Dedicated Server
On average I find sites hosted on GoDaddy to have the worst performance.
$time = microtime(true);
$results = mysql_query("SELECT * FROM `cities` WHERE `country` LIKE 'RS'");
echo "\n" . number_format(microtime(true)-$time,6)."\n";
$time = microtime(true);
while ($row = mysql_fetch_array($results, MYSQL_NUM)){$r[]=$row;}
echo "\n" . number_format(microtime(true)-$time,6);
Results:
With Index: 2.9mS
0.002947 Seconds : $results = mysql_query("SELECT * FROM `cities` WHERE `country` LIKE 'RS'");
0.000081 Seconds : while ($row = mysql_fetch_array($results, MYSQL_NUM)){$r[]=$row;}
Without Index 93mS
0.093939 Seconds : $results = mysql_query("SELECT * FROM `cities` WHERE `country` LIKE 'RS'");
0.000073 Seconds : while ($row = mysql_fetch_array($results, MYSQL_NUM)){$r[]=$row;}
Then in phpMyAdmin Profiling:
SET PROFILING = ON;
SELECT * FROM `cities` WHERE `country` LIKE 'RS';
SHOW PROFILE;
Result:
Execution of the Query took 0.0000003 seconds
starting 0.000020
checking permissions 0.000004
Opening tables 0.000006
init 0.000007
optimizing 0.000003
executing 0.000003 ******
end 0.000004
query end 0.000003
closing tables 0.000003
freeing items 0.000010
logging slow query 0.000003
cleaning up 0.000003
Without Index
Execution of the Query took 0.0000012 seconds
starting 0.000046
checking permissions 0.000006
Opening tables 0.000010
init 0.000021
optimizing 0.000006
executing 0.000012 ******
end 0.000003
query end 0.000004
closing tables 0.000003
freeing items 0.000017
logging slow query 0.000004
cleaning up 0.000003
In phpMyAdmin doing a Search with Profiling turn on
GoDaddy Server Sending Data 92.6 ms
SELECT * FROM `cities` WHERE `country` LIKE 'RS' LIMIT 1000
Showing rows 0 - 928 (929 total, Query took 0.0907 sec)
Profiling Results:
Starting 52 µs
Checking Permissions 7 µs
Opening Tables 23 µs
System Lock 12 µs
Init 34 µs
optimizing 10 µs
Statistics 23 µs
Preparing 17 µs
Executing 4 µs
Sending Data 92.6 ms
End 18 µs
Query End 4 µs
Closing Tables 15 µs
Freeing Items 27 µs
Logging Slow Query 4 µs
Cleaning Up 5 µs
In phpMyAdmin doing a Search with Profiling turn on
On my Server, Sending Data 1.8mS
SELECT * FROM `cities` WHERE `country` LIKE 'RS' LIMIT 1000
Showing rows 0 - 928 (929 total, Query took 0.0022 sec)
Starting 27 µs
Checking Permissions 5 µs
Opening Tables 11 µs
System Lock 7 µs
Init 14 µs
Optimizing 5 µs
Statistics 43 µs
Preparing 6 µs
Executing 2 µs
Sending Data 1.8 ms
End 5 µs
Query End 3 µs
Closing Tables 5 µs
Freeing Items 13 µs
Logging Slow Query 2 µs
Cleaning Up 2 µs
Just to show the importance of an index.Over 400x Improvement.
A table with 5,480,942 Records and a Query that Returns 899 Rows
$time = microtime(true);
$results = mysql_query("SELECT * FROM `ipLocations` WHERE `id` = 33644");
echo "\n" . number_format(microtime(true)-$time,6);
$time = microtime(true);
while ($row = mysql_fetch_array($results, MYSQL_NUM)){$r[]=$row;}
echo "\n" . number_format(microtime(true)-$time,6);
No index
0.402005
0.001264
With Index (426x Faster)
0.001716
0.001962

Pandas groupby for k-fold cross-validation with aggregation

say I have a data frame,df, with columns: id |site| time| clicks |impressions
I want to use the machine learning technique of k-fold cross validation ( split the data randomly into k=10 equal sized partitions - based on eg column id) . I think of this as a mapping from id: {0,1,...9} ( so new column 'fold' going from 0-9)
then iteratively take 9/10 partitions as training data and 1/10 partition as validation data
( so first fold==0 is validation, rest is training, then fold==1, rest is training)
[ so am thinking of this as a generator based on grouping by fold column]
finally I want to group all the training data by site and time ( and similarly for validation data) ( in other words sum over the fold index, but keeping the site and time indices)
What is the right way of doing this in pandas?
The way I thought of doing it at the moment is
df_sum=df.groupby( 'fold','site','time').sum()
#so df_sum has indices fold,site, time
# create new Series object,dat, name='cross' by mapping fold indices
# to 'training'/'validation'
df_train_val=df_sum.groupby( [ dat,'site','time']).sum()
df_train_val.xs('validation',level='cross')
Now the direct problem I run into is that groupby with columns will handle introducing a Series object but groupby on multiindices doesn't [df_train_val assignment above doesn't work]. Obviously I could use reset_index but given that I want to group over site and time [ to aggregate over folds 1 to 9, say] this seems wrong. ( I assume grouping is much faster on indices than on 'raw' columns)
So Question 1 is this the right way to do cross-validation followed by aggregation in pandas. More generally grouping and then regrouping based on multiindex values.
Question 2 - is there a way of mixing arbitrary mappings with multilevel indices.
This generator seems to do what I want. You pass in the grouped data (with 1 index corresponding to the fold [0 to n_folds]).
def split_fold2(fold_data, n_folds, new_fold_col='fold'):
i_fold=0
indices=list(fold_data.index.names)
slicers=[slice(None)]*len(fold_data.index.names)
fold_index=fold_data.index.names.index(new_fold_col)
indices.remove(new_fold_col)
while (i_fold<n_folds):
slicers[fold_index]=[i for i in range(n_folds) if i !=i_fold]
slicers_tuple=tuple(slicers)
train_data=fold_data.loc[slicers_tuple,:].groupby(level=indices).sum()
val_data=fold_data.xs(i_fold,level=new_fold_col)
yield train_data,val_data
i_fold+=1
On my data set this takes :
CPU times: user 812 ms, sys: 180 ms, total: 992 ms Wall time: 991 ms
(to retrieve one fold)
replacing train_data assignment with
train_data=fold_data.select(lambda x: x[fold_index]!=i_fold).groupby(level=indices).sum()
takes
CPU times: user 2.59 s, sys: 263 ms, total: 2.85 s Wall time: 2.83 s