I am having millions of devices and from these millions of devices, I am going to get counter values at periodic intervals for those devices.
This particular value has to be held in memory and i have to do further processing on top of this.
This is my problem. To handle this problem, I am using Pandas DataFrame with 2 columns. One column is time stamp and the other column being actual counter value.
There are multiple options that I am looking at.
1.) One DataFrame for all these 1 million devices.
2.) The 1 million devices are logically grouped based on the location of the device. So instead of having all 1 million device counters in a single DataFrame,
I can create multiple DataFrames based on the location of the server where device resides.
If we do the 2nd way, it will come close to 1000 DataFrames. So I can have 1000 DataFrames based on the grouping of the devices on location.
So it comes to 1 DataFrame for 1 group. The reason of splitting this into 1000 DataFrames is because of the parallelism (both read/write) which I can do (multi-processing).
I can have parallelism implemented effectively. And also the time that it takes to do some calculations on the DataFrame will be a simple operation in smaller size of DataFrame.
Some of the questions that I have is:
1.) what is the maximum number of DataFrames that we can create in Pandas in a system that has 1-2GB of RAM
2.) Having more number of DataFrames, will it induce any overhead in pandas
I need some opinion on this like what is the best way - should I go with 1000 DataFrames or 1 huge DataFrame.
Please note: I am getting the data from the devices at a periodic interval (e.g. 5 minutes) which is in the network and the total devices approx. is 1000000.
Also regularly, depending on some conditions, I need to delete the old data. Its not just reading the data and appending the data in DataFrame but also deletion of old data.
Related
Summary: I am facing performance issues merging pandas dataframes. I am looking for a technique to merge thousands of columns into a single dataframe. I have tried
.merge()
pre-creating cols with reindex()
dask
Background
I have a postgres database with hundreds of millions of rows that represent 1.5million features of a patient... i'm trying to boil it down into something more manageable, about 100,000 features.
I have a patient dataframe with 6000 rows representing patients: unique ID, and a few features. I have a method to identify the most important features, and now I want to merge those into my dataframe on the patient ID key. This works fine with only a few thousand features (3000), but as I scale, it starts to crawl.
Pandas merge is single threaded, and is elegant but too slow for this many features. Its working but takes a few days -- I think it re-allocates memory with each merge.
I tried pre-creating the columns with Null values ensure all are sorted correctly, and then update the column, but when I try to update the column it is still very slow. It has been running for 5 hours now, using only 1 core. Not sure how long it will take. My logs say it has only finished 4000/100,000 so far. That doesn't seem right but I forgot to flush the print so I'll have to be patient.
I tried dask but it doesn't appear to perform a multithreaded merge - it sticks to one CPU at the point where I merge them. (The dask dataframe was created from a pandas dataframe in case that matters).
Postgres has 1600 column limit, so I can't transpose to a table
I'm not limited to python,pandas... and would to stick to my current stack (postgres, python, and R) -- can python scale to that size dataframe (6000x100000)?
Any thoughts?
I had a performance issue to ask your input on.
This is singly based in Databricks on Azure with storage on Azure Data Lake Storage. And tech stack is not more than 2 years old and is all up to the most recent release.
Say I have a datamart Delta table, 100 columns, 30,000,000 rows, grows 225,000 rows every calendar-quarter.
There isn't a Datawarehouse in this architecture, so the newest 225,000 rows are simply appended to the datamart; 30,000,000+ and growing every quarter.
Two columns are a dimension key Dim1_cd and a matching Dim1_desc.
There are 36 other dimensions key-value pairs in the datamart much like Dim1 is a key-value pair.
The datamart is a list of transactions, has a Period column, eg. "2021Q3", and Period is the first level and only partition of the datamart.
The partition divides the delta table into 15 partition folders currently. Each with 100Meg-ish size files numbering in the 150-parquet file range per partition folder.
A calendar-quarter later a new set of files is delivered and to be appended to the datamart, one of which is a Dim1_lookup.txt file which is first read into a Dim1_deltaTable; and has only two columns Dim1_cd and Dim1_desc. Each row is 3rd normal form distinct. On disk, the Dim1_lookup.txt file is only 55K large.
Applying this dimension’s newest version will sometimes have to take only 3-4 minutes, where there are not any Dim1_desc values needing to be updated. Other times, there are 20,000 to 100,000 updates to be written across 100s to 1000s of parquet files and can take an unpleasant long time.
Of course, writing the code of a delta table update to apply the Dim1_deltaTable is no big challenge.
But what can you suggest how to optimize the updates?
Ideally you might have a Datawarehouse backing the datamart, but not the case in this architecture.
You might want to partition Dim1_desc to take advantage of delta’s data skipping but there are 36 other _desc fields you have the same update concerns for.
What do you consider possible to optimize the update/minimize update processing time?
I have a dataframe with three columns: ID, x_coordinate, y_coordinate. Each ID appears as many times as many coordinates are available, which vary from ID to ID. I am planning to do bin averaging and outlier removal for each ID. At the end I would like to have a dataframe with one row for each ID and the final avg_x_coordinate and avg_y_coordinate.
What data structure and transformation would you recommend to go with? Ideally, without loops. (The binning I am planning to do with 2dhistograms.)
I have 500 MB csv file which I am reading as a dataframe
I am looking for an optimal value of partition of this dataframe
I need to do some wide transformations and join this dataframe with another csv and so I have below 3 approaches now for re-partitioning this dataframe
df.repartition(no. of cores)
Re-partitioning the dataframe as per calculation 500MB/128MB~ 4 partitions so as to have at least 128MB data in each partition
Re-partitioning dataframe using specific columns of csv so as to co-locate data in same partitions
I want to know which of these option will be best for having parallel computation and processing in Spark 2.4
If you know the data very well then using the columns to partition the data works best. However, repartitioning based on either based on the block size and the no of cores are subject to change whenever the cluster configuration changes and for every environment you need to make those changes if your cluster configuration is different in higher environments. So, all in all going to with Data driven repartitioning is the better approach.
I have two datasets, one is around 45GB and it contains daily transactions for 1 year and the second one is 3.6GB and contains customer IDs and details. I want to merge the two together on a common column to create a a single dataset, which exceeds the memory of the server since there can be multiple transactions per customer. I am working on a windows server with 16 cores and 64GB RAM which I understand is very limited specs for this type of work.
Methodology
Read the big dataset into a dask dataframe and set the index to be the customer ID. Read the 3.6GB dataset in pandas and set the index to be the customer ID. Launch a local cluster, with the parameter memory_limit='50GB' and processes=False.
Merge the dask dataframe with the pandas dataframe on index (left_index=True, right_index=True).
This method creates 75000 tasks which eventually blow up the memory.
Is what I am trying to accomplish possible? Have I chosen the wrong tools for that? I am running out of ideas and I would desperately need some help.
Yes, what you want to do is possible, but you may need to play with partition sizes a bit. If there is a lot of repetition in your data then Pandas may suddenly produce very large values. You can address this by ...
Using smaller partitions (maybe)
Reducing the amount of parallelism (perhaps try dask.config.set(scheduler="single-threaded") so see if that helps