im not able to view my records which is available in collection using pymongo with aadhar datasets - pymongo

I'm trying to process my aadhardataset using pymongo. I have done queries using normal sample documents when I processed with my datasets the records is not visible , but my collection contains 400k records which is available in aadhardataset. I am able to count the number of records in my collection but I'm not able to view my records.
df = pandas.read_csv('/home/hadoop/Downloads/aadhar.csv')
database = client.database
collect = database.collect
result = database.collect.insert_many([{'df': i} for i in range(418068)])
database.collect.count()
I'm getting the count as 418 068 which is there in aadhardataset
When I'm using this below condition to view single document it's not working
import pprint
pprint.pprint(collect.find_one())
I'm not able to view my records which available in collections

Related

How to take sample from dask dataframe having all the products ordered by certain number of customers alone?

I tried loading my csv file using pd.read_csv. It has 33 million records and takes too much time for loading and querying also.
I have data of 200k customers.
This is the code I have written for sampling
Data is loading quickly when using a dask dataframe but takes much time for queries.
df_s = df.sample(frac = 300000/33819106,replace = None,random_state = 10)
This works fine but the customers have ordered many products. In the sample how to include all the products of the customers. How to sample based on customer id?
Load your data into a dataframe and then sample from it. Output to a new .csv that is easier to read from.
df = pd.read_csv('customers.csv')
df = df.sample(frac=.2) # 20% of the rows will be sampled.
df.to_csv('sample_customers.csv') # Create an easier to work with .csv
Generally the format of a question on here is
Description of problem
Desired outcome
What you've tried
Minimum reproducible example

Push data set alternate to Direct query

I have created a powerBI using Direct query to get live reports. But when changing slicer filter in visual it takes long time to load data in visual table. So I searched for alternate to direct query and found Push dataset. But on analysis, seems push dataset uses API for streaming. Is it possible to use the push dataset for the below select query(Have 15 columns and 20k rows) as simple as direct query.
EX: select * from persons p
left join students s on s.id=p.id

Keep queried SQL data in one state while updating

The use case as follows:
I scrape data from a bigger database (only read access) on a fixed schedule and it takes roughly 30mins to 1 hour
the resulted table will always have >20k rows, the data can be grouped by a dataset_id column which is constricted to 4 values like an enum
when I query all of the rows of one dataset_id let's say SELECT * FROM db WHERE dataset_id = A it is essential that all the records are from the same scrape (so I shouldn't have mixed data from different scrapes)
The question is how would I persist the old data until the new scrape is finished and only then switch to get the new data while deleting the old scrape?
I have thought of the following option:
have 2 tables and switch between them when a newer scrape is finished

Running regressions iteratively for subsets of pySpark dataframes - partitioning by DF columns or mapPartitions?

I have sales data by store and product_category for every week in the following format.
STORE|PRODUCT_CAT|WK_ENDING|<PREDICTOR_VARIABLES>|TARGET_VARIABLE
S1|P1||2016-01-01|..|....
S1|P1||2016-01-08|..|....
S1|P1||2016-01-15|..|....
S1|P2||2016-01-01|..|....
S1|P2||2016-01-08|..|....
S1|P2||2016-01-15|..|....
S2|P1||2016-01-01|..|....
S2|P1||2016-01-08|..|....
S2|P1||2016-01-15|..|....
S2|P2||2016-01-01|..|....
S2|P2||2016-01-08|..|....
S2|P2||2016-01-15|..|....
...
...
As you can see it has multiple records by week for every Store - Product combination.
There could be about 200 different stores and ~50 different product categories i.e. we would have ~200 x ~50 = ~10,000 different Store - product combinations (say). For every such combination we will have data for about 4-5 years i.e. 250 records say.
Requirement is that we run separate regresssion models for each of the store-prod combinations.That means we need to run thousands of regressions but on very small datasets. What is the way to go about this?
Options tried / thought about -
1. Usual "FOR" loops -
Extracted the unique Store-category combinations and then for each store and for each cat (nested for loop), filtered the data from the above DF and ran the models.
The process runs for about 10-12 stores and then throws memory errors. Note that the above DF is persisted.
I have seen for other similar computations, pySpark is not able to handle for loops, if it has to reference the same DF from inside the For loop.
Following is the code snippet -
main_df.persist() # This is the master dataframe, containing all the above data that is persisted
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
for store in store_lst:
<some calculations like filtering the master dataframe by store etc.. >
main_df_by_store = main_df.filter(main_df['store_id']==str(store))
for cat in cat_lst:
assembler=VectorAssembler(inputCols=['peer_pos_sales'],outputCol='features')
traindata=main_df_by_store.filter(main_df_by_store['rbt_category']==str(cat))
output = assembler.transform(traindata)
modelfit=output.drop('peer_pos_sales').withColumnRenamed('vacant_pos_sales','label')
lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
lrModel = lr.fit(modelfit)
result = lrModel.transform(modelfit)
Can we create a Window Function, partitioned by Store, Category and then apply a UDF to run the regressions?
However, it appears that we can only use built-in functions for Window functions, and not UDF? Is that correct?
How to handle this? Looping is killing the server.
This needs to be done in pySpark only.

Django: access and update a variable on which to filter during a Django model query?

I'm hoping to build a Django query to my model that lets my filter change as the query progresses.
I have a model Activity that I'm querying for. Each object has a postal_code field and I'm querying for multiple zip codes stored in an array postal_codes_to_query across a date range. I'd like to ensure that I get an even spread of objects across each of the zip codes. My database has millions of Activities, so when I query with a limit, I only receive activities that match zip codes early on in postal_codes_to_query. My current query is below:
Activity.objects.filter(postal_code__in=postal_codes_to_query).filter(start_time_local__gte=startTime).filter(start_time_local__lte=endTime).order_by('start_time_local')[:10000]
If I'm searching for say 20 zip codes, Ideally I'd like to receive 10000 activities, with 500 activities for each zip code that I queried on.
Is this possible in Django? If not, is there some custom SQL I could write to achieve this? I'm using a Heroku Postgres database in case that matters.
You can't do this in a single query, either in Django nor (as far as I know) in SQL.
The best bet is simply to iterate through the list of zips, querying for max 500 in each one:
activities_by_zip = {}
for code in postal_codes_to_query:
activities = Activity.objects.filter(postal_code=code).filter(
start_time_local__gte=startTime).filter(
start_time_local__lte=endTime).order_by('start_time_local')[:500]
activities_by_zip[code] = activities
Of course, this is one query per zip, but I think that's the best you're going to do.