How to take sample from dask dataframe having all the products ordered by certain number of customers alone? - pandas

I tried loading my csv file using pd.read_csv. It has 33 million records and takes too much time for loading and querying also.
I have data of 200k customers.
This is the code I have written for sampling
Data is loading quickly when using a dask dataframe but takes much time for queries.
df_s = df.sample(frac = 300000/33819106,replace = None,random_state = 10)
This works fine but the customers have ordered many products. In the sample how to include all the products of the customers. How to sample based on customer id?

Load your data into a dataframe and then sample from it. Output to a new .csv that is easier to read from.
df = pd.read_csv('customers.csv')
df = df.sample(frac=.2) # 20% of the rows will be sampled.
df.to_csv('sample_customers.csv') # Create an easier to work with .csv
Generally the format of a question on here is
Description of problem
Desired outcome
What you've tried
Minimum reproducible example

Related

calculate count by pandas and display/filter in pwoerBI

I have a dataset that is about the user action logging on web page. Due to the whole dataset is huge, I want to use pandas to calculate the count first, and then visualize the count data in powerBI. And the important thing is that i can filter the data

paginate a response with two data source

I have two sources of data , I fetch data from both the sources and combine them on basis of date and send as response.
Now I am trying to send response as a paginated response but I don't have idea how to do this because there is no way to know how many rows should be asked from source 1 and source 2.
for example :
consider two tables : comments and likes
one store comments :(this is a mysql table)
comment
Date
second store like :(this I compute from some other data source)
Like
Date
now I want send comments and likes combine result as paginated,
suppose I ask for response with offset = 0 and limit =20
here I don't know how many rows I should take from comment table and how many from like data source.
for first time I can merge data and slice first 20 but for this I will have to merge these data source all the time and then slice (which is not possible due to constrain)
limitations : cannot have them in same table or db and cannot merge complete data every time and send response by slicing.
please help me to figure out this problem.
I have tried to explain everything at my fullest. please ask if anything else needed.

Moving files from one parquet partition to another

I have a very large amount of data in my S3 bucket partitioned by two columns MODULE and DATE
such that the file structure of my parquets are:
s3://my_bucket/path/file.parquet/MODULE='XYZ'/DATE=2020-01-01
I have 7 MODULE and the DATE ranges from 2020-01-01 to 2020-09-01.
I found a discrepancy in the data and need to correct the MODULE entries for one of the module. Basically I need to change all data for a particular index number, belonging to MODULE XYZ to MODULE ABC.
I can do this in pyspark by loading the data frame and doing something like:
df=df.withColumn('MODULE', when(col('index')==34, "ABC").otherwise(col('MODULE')))
But how do I repartition it so that only those entries that are changed get moved to the ABC MODULE partition? If I do something like:
df.mode('append').partitionBy('MODULE','DATE').parquet(s3://my_bucket/path/file.parquet")
I would be adding the data along with the erroneous MODULE data. Plus, I have almost a years worth of data and don't want to repartition the entire dataset as it would take a very long time.
Is there a way to do this?
If I understand well, you have data in partition MODULE=XYZ that should be moved to MODULE=ABC.
First, identify the impacted files.
from pyspark.sql import functions as F
file_list = df.where(F.col("index") == 34).select(
F.input_file_name()
).distinct().collect()
Then, you create a dataframe based only on theses files, you use it to complete both MODULE.
df = spark.read.parquet(file_list).withColumn(
"MODULE", when(col("index") == 34, "ABC").otherwise(col("MODULE"))
)
df.write.parquet(
"s3://my_bucket/path/ABC/", mode="append", partitionBy=["MODULE", "DATE"]
)
At this point, ABC should be OK (you just added the missing data), but XYZ should be wrong because of duplicate data. To recover XYZ, you just need to delete the list of files in file_list.
IIUC you can do this by filtering the data for that particular index then save that data with date as partition.
df=df.withColumn('MODULE', when(col('index')==34, "ABC").otherwise(col('MODULE')))
df = df.filter(col('index')==34)
df.mode('overwrite').partitionBy('DATE').parquet(s3://my_bucket/path/ABC/")
In this way you will only end up modifying only the changed module i.e. ABC

Running regressions iteratively for subsets of pySpark dataframes - partitioning by DF columns or mapPartitions?

I have sales data by store and product_category for every week in the following format.
STORE|PRODUCT_CAT|WK_ENDING|<PREDICTOR_VARIABLES>|TARGET_VARIABLE
S1|P1||2016-01-01|..|....
S1|P1||2016-01-08|..|....
S1|P1||2016-01-15|..|....
S1|P2||2016-01-01|..|....
S1|P2||2016-01-08|..|....
S1|P2||2016-01-15|..|....
S2|P1||2016-01-01|..|....
S2|P1||2016-01-08|..|....
S2|P1||2016-01-15|..|....
S2|P2||2016-01-01|..|....
S2|P2||2016-01-08|..|....
S2|P2||2016-01-15|..|....
...
...
As you can see it has multiple records by week for every Store - Product combination.
There could be about 200 different stores and ~50 different product categories i.e. we would have ~200 x ~50 = ~10,000 different Store - product combinations (say). For every such combination we will have data for about 4-5 years i.e. 250 records say.
Requirement is that we run separate regresssion models for each of the store-prod combinations.That means we need to run thousands of regressions but on very small datasets. What is the way to go about this?
Options tried / thought about -
1. Usual "FOR" loops -
Extracted the unique Store-category combinations and then for each store and for each cat (nested for loop), filtered the data from the above DF and ran the models.
The process runs for about 10-12 stores and then throws memory errors. Note that the above DF is persisted.
I have seen for other similar computations, pySpark is not able to handle for loops, if it has to reference the same DF from inside the For loop.
Following is the code snippet -
main_df.persist() # This is the master dataframe, containing all the above data that is persisted
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
for store in store_lst:
<some calculations like filtering the master dataframe by store etc.. >
main_df_by_store = main_df.filter(main_df['store_id']==str(store))
for cat in cat_lst:
assembler=VectorAssembler(inputCols=['peer_pos_sales'],outputCol='features')
traindata=main_df_by_store.filter(main_df_by_store['rbt_category']==str(cat))
output = assembler.transform(traindata)
modelfit=output.drop('peer_pos_sales').withColumnRenamed('vacant_pos_sales','label')
lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
lrModel = lr.fit(modelfit)
result = lrModel.transform(modelfit)
Can we create a Window Function, partitioned by Store, Category and then apply a UDF to run the regressions?
However, it appears that we can only use built-in functions for Window functions, and not UDF? Is that correct?
How to handle this? Looping is killing the server.
This needs to be done in pySpark only.

How do i write partitioned data to a file with the partition value in the filename?

I unloaded a bunch of data from an RDBMS by month, and loaded it into Google Cloud Storage (GCS) based on that month. I then read the entire data set into a pyspark data frame on a dataproc cluster, and would like to re-write it to GCS based on the day, rather than the month. I've successfully written to cloud storage where each file only contains a certain date, but have not been able to efficiently name the file or directory based on that date. The code below does what I want it to do, but it is VERY inefficient. I also know I could theoretically get around this by rather using parquet files, but my requirements are to write as CSV. Ultimately I want to load this data into bigquery with a table per day, if there is a solution there that would be easier (and I could then just export each per day table to a file).
# Find distinct dates, and then write based on that.
dates = sqlContext.sql("SELECT distinct THE_DATE FROM tbl")
x = dates.collect()
for d in x:
date = d.SLTRN_DT
single_wk = sqlContext.sql("SELECT * FROM tbl where THE_DATE = '{}'".format(date))
towrite = single_wk.map(to_csv)
towrite.coalesce(4).saveAsTextFile('gs://buck_1/AUDIT/{}'.format(date))
So say the data I read in has the dates ['2014-01-01', '2014-01-02', '2014-01-03'] I would want the resulting files / directories to look like this:
gs://buck_1/AUDIT/2014-01-01/part-1
gs://buck_1/AUDIT/2014-01-01/part-2
gs://buck_1/AUDIT/2014-01-01/part-3
gs://buck_1/AUDIT/2014-01-01/part-4
gs://buck_1/AUDIT/2014-01-02/part-1
gs://buck_1/AUDIT/2014-01-02/part-2
gs://buck_1/AUDIT/2014-01-02/part-3
gs://buck_1/AUDIT/2014-01-02/part-4
gs://buck_1/AUDIT/2014-01-03/part-1
gs://buck_1/AUDIT/2014-01-03/part-2
gs://buck_1/AUDIT/2014-01-03/part-3
gs://buck_1/AUDIT/2014-01-03/part-4