I am planning to use Pandas HDFStore as temporary file for out of core csv operations.
(csv --> HDFStore --> Out of core operation in pandas).
Just wondering :
Limit in size of HDF5 for real life practical usage on 1 machine
(not the theoritical one....)
Cost of operation for pivot tables (100 columns, fixed VARCHAR, numerical).
Whether I would need to switch to Postgres (load csv into Postgres) and DB stuff...
Tried to find on google some benchmark limit size vs computation time for HDF5, but could not find any.
Total size of csv is around 500Go - 1To (uncompressed).
Related
In my spark job, I'm reading a huge table (parquet) with more than 30 columns. To limit the size of data read I specify schema with one column only (I need only this one). Unfortunately, when reading the info in spark UI I get the information that the size of files read equals 1123.8 GiB but filesystem read data size total equals 417.0 GiB. I was expecting that if I take one from 30 columns the filesystem read data size total will be around 1/30 of the initial size, not almost half.
Could you explain to me why is that happening?
I have a two-part question about Dask+Parquet. I am trying to run queries on a dask dataframe created from a partitioned Parquet file as so:
import pandas as pd
import dask.dataframe as dd
import fastparquet
##### Generate random data to Simulate Process creating a Parquet file ######
test_df = pd.DataFrame(data=np.random.randn(10000, 2), columns=['data1', 'data2'])
test_df['time'] = pd.bdate_range('1/1/2000', periods=test_df.shape[0], freq='1S')
# some grouping column
test_df['name'] = np.random.choice(['jim', 'bob', 'jamie'], test_df.shape[0])
##### Write to partitioned parquet file, hive and simple #####
fastparquet.write('test_simple.parquet', data=test_df, partition_on=['name'], file_scheme='simple')
fastparquet.write('test_hive.parquet', data=test_df, partition_on=['name'], file_scheme='hive')
# now check partition sizes. Only Hive version works.
assert test_df.name.nunique() == dd.read_parquet('test_hive.parquet').npartitions # works.
assert test_df.name.nunique() == dd.read_parquet('test_simple.parquet').npartitions # !!!!FAILS!!!
My goal here is to be able to quickly filter and process individual partitions in parallel using dask, something like this:
df = dd.read_parquet('test_hive.parquet')
df.map_partitions(<something>) # operate on each partition
I'm fine with using the Hive-style Parquet directory, but I've noticed it takes significantly longer to operate on compared to directly reading from a single parquet file.
Can someone tell me the idiomatic way to achieve this? Still fairly new to Dask/Parquet so apologies if this is a confused approach.
Maybe it wasn't clear from the docstring, but partitioning by value simply doesn't happen for the "simple" file type, which is why it only has one partition.
As for speed, reading the data in one single function call is fastest when the data are so small - especially if you intend to do any operation such as nunique which will require a combination of values from different partitions.
In Dask, every task incurs an overhead, so unless the amount of work being done by the call is large compared to that overhead, you can lose out. In addition, disk access is not generally parallelisable, and some parts of the computation may not be able to run in parallel in threads if they hold the GIL. Finally, the partitioned version contains more parquet metadata to be parsed.
>>> len(dd.read_parquet('test_hive.parquet').name.nunique())
12
>>> len(dd.read_parquet('test_simple.parquet').name.nunique())
6
TL;DR: make sure your partitions are big enough to keep dask busy.
(note: the set of unique values is already apparent from the parquet metadata, it shouldn't be necessary to load the data at all; but Dask doesn't know how to do this optimisation since, after all, some of the partitions may contain zero rows)
We use the BigQuery Java API to upload data from local data source as described here. When uploading a Parquet file with 18 columns (16 string, 1 float64, 1 timestamp) and 13 Mio rows (e.g. 17GB of data) the upload fails with the following exception:
Resources exceeded during query execution: UDF out of memory.; Failed
to read Parquet file . This might happen if the file contains a row
that is too large, or if the total size of the pages loaded for the
queried columns is too large.
However when uploading the same data using CSV (17.5GB of data) the upload succeeds. My questions are:
What is the difference when uploading Parquet or CSV?
What query is executed during upload?
Is it possible to increase the memory for this query?
Thanks
Tobias
Parquet is columnar data format, which means that loading data requires reading all columns. In parquet, columns are divided into pages. BigQuery keeps entire uncompressed pages for each column in memory while reading data from them. If the input file contains too many columns, BigQuery workers can hit Out of Memory errors.
Even when a precise limit is not enforced as it happens with other formats, it is recommended that records should in the range of 50 Mb, loading larger records may lead to resourcesExceeded errors.
Taking into account the above considerations, it would be great to clarify the following points:
What is the maximum size of rows in your Parquet file?
What is the maximum page size per column?
This info can be retrieved by publicly available tool.
If you think about increasing the alocated memory for queries, you need to read about Bigquery slots.
In my case, I ran bq load --autodetect --source_format=PARQUET ... which failed with the same error (resources exceeded during query execution). Finally, I had to split the data into multiple Parquet files so that they would be loaded in batches.
I'm using pyspark to process some data and write the output to S3. I have created a table in athena which will be used to query this data.
Data is in the form of json strings (one per line) and spark code reads the file, partition it based on certain fields and write to S3.
For a 1.1 GB file, I see that spark is writing 36 files with 5 MB approx per file size. when reading athena documentation I see that optimal file size is ~128 MB . https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/
sparkSess = SparkSession.builder\
.appName("testApp")\
.config("spark.debug.maxToStringFields", "1000")\
.config("spark.sql.sources.partitionOverwriteMode", "dynamic")\
.getOrCreate()
sparkCtx = sparkSess.sparkContext
deltaRdd = sparkCtx.textFile(filePath)
df = sparkSess.createDataFrame(deltaRdd, schema)
try:
df.write.partitionBy('field1','field2','field3')\
.json(path, mode='overwrite', compression=compression)
except Exception as e:
print (e)
why spark is writing such smaller files. Is there any way to control file size.
Is there any way to control file size?
There are some control mechanism. However they are not explicit.
The s3 drivers are not part of spark itself. They are part of the hadoop installation which ships with spark emr. The s3 block size can be set within
/etc/hadoop/core-site.xml config file.
However by default it should be around 128 mb.
why spark is writing such smaller files
Spark will adhere to the hadoop block size. However you can use partionBy before writing.
Lets say you use partionBy("date").write.csv("s3://products/").
Spark will create a subfolder with the date for each partition. Within
each partioned folder spark will again try to create chunks and try to adhere to the fs.s3a.block.size.
e.g
s3:/products/date=20191127/00000.csv
s3:/products/date=20191127/00001.csv
s3:/products/date=20200101/00000.csv
In the example above - a particular partition can just be smaller than a blocksize of 128mb.
So just double check your block size in /etc/hadoop/core-site.xml and wether you need to partition the data frame with partitionBy before writing.
Edit:
Similar post also suggests to repartition the dataframe to match the partitionBy scheme
df.repartition('field1','field2','field3')
.write.partitionBy('field1','field2','field3')
writer.partitionBy operates on the existing dataframe partitions. It will not repartition the original dataframe. Hence if the overall dataframe is paritioned differently, there is nested partitioning happening.
I have 3 huge data frames of 40 GB size, I opened them using chunks. Then, I wanna concatenate them together. Here is what I tried:
path = 'path/to/myfiles'
files = [os.path.join(path,i) for i in os.listdir(path) if i.endswith('tsv')]
for file in files:
cols = ['col1','col2','col3']
chunks = pd.read_table(file, sep='\t', names=cols, chunksize=10000000)
However, when I try to concatenate all the files, it is taking forever.
I would like to have some suggestions to concatenate all the data frames quicker/faster.
CSV/TSV is a very slow file format, not optimized.
You probably don't need to keep the entire dataset in-memory. Your use-case probably doesn't need full random column- and row-access across the entire combined (120GB) dataset.
(Can you process each row/chunk/group (e.g. zipcode, user_id, etc.) serially? e.g. to compute aggregates, summary statistics, features? Or do you need to be able to apply arbitrary filters across columns (which columns), or rows (which columns)? e.g. "Get all userid's who used service X within the last N days". You can choose a higher-performance file format based on your use-case. There are alternative file formats (HDFS, PARQUET etc.) Some are optimized for columnar access, or row access, some for sequential or random access. There is also PySpark.
You don't necessarily need to combine your dataset into one huge monolithic 120GB file.
You're saying the runtime is slow, but likely you're blowing out memory (in which case runtime goes out the window), so your first check your memory usage.
Your code is trying to read in and store all chunks of each file, not process them individual chunk-by-chunk across the three files: for file in files: ... chunks = pd.read_table(file, ... chunksize=10000000). See Iterating through files chunk by chunk, in pandas.
after you fix that, chunksize=1e7 parameter is not the size of the memory chunk; it's only the number of rows in the chunk. That value is insanely large. If one row of the combined dataframes were to take say 10Kb, then a chunk of 1e7 such rows would take 100Gb(!), which will not fit in most machines.
If you must stick with using CSV, process one single chunk across each of the three files, then write its output to file, don't leave all the chunks hanging around in-memory. Also reduce your chunksize (try e.g. 1e5 or less, and measure the memory and runtime improvement). Also don't hardcode it, figure out a sane value per-machine, and/or make it a command-line parameter. Monitor your memory usage.
.tsv and .csv are fairly slow formats to read/write. I've found parquet works best for most of the stuff I end up doing. It's quite fast on reads and writes, and also allows you to read back a chunked folder of files as a single table easily. It does require string column names, however:
In [102]: df = pd.DataFrame(np.random.random((100000, 100)), columns=[str(i) for i in range(100)])
In [103]: %time df.to_parquet("out.parquet")
Wall time: 980 ms
In [104]: %time df.to_csv("out.csv")
Wall time: 14 s
In [105]: %time df = pd.read_parquet("out.parquet")
Wall time: 195 ms
In [106]: %time df = pd.read_csv("out.csv")
Wall time: 1.53 s
If you don't have control over the format those chunked files are in, you'll obviously need to pay the read cost of them at least once, but converting them could still save you some time in the long run if you do a lot of other read/writes.