Concat Pandas DF with CSV File - pandas

I want to concat 2 data-frames into one df and save as one csv considering that the first dataframe is in csv file and huge so i dont want to load it in memory. I tried the df.to_csv with append mode but it doesnt behave like df.concat in regards to different columns (comparing and combining columns). Anyone knows how to concat a csv and a df ? Basically csv and df can have different columns so the output csv should have only one header along with all columns and proper respective rows.

You can use Dask DataFrame to do this operation lazily. It'll load your data into memory, but do so in small chunks. Make sure to keep the partition size (blocksize) reasonable -- based on your overall memory capacity.
import dask.dataframe as dd
ddf1 = dd.read_csv("data1.csv", blocksize=25e6)
ddf2 = dd.read_csv("data2.csv", blocksize=25e6)
new_ddf = dd.concat([ddf1, ddf2])
new_ddf.to_csv("combined_data.csv")
API docs: read_csv, concat, to_csv

Related

Save columns from Pandas dataframe to csv, without reading the csv file

I was wondering if there is a method to store ones columns from a dataframe to an already existing csv file without reading the entire file first?
I am working with a very large dataset, where I read 2-5 columns of the dataset, use them for calculating a new variable(column) and I want to store this variable to the entire dataset. My memory can not load the entire dataset at once and therefore I am looking for a way to store the new columns to the entire dataset without loading all of it.
I have tried using chunking with:
df = pd.read_csv(Path, chunksize = 10000000)
But then I am faced with the Error "TypeError: 'TextFileReader' object is not subscriptable" When trying to process the data.
The data is also grouped by two variables and therefore chunking is not preferred when doing these calculations.

Save memory with big pandas dataframe

I have got a huge dataframe (pandas): 42 columns, 19 millions rows and different dtypes. I load this dataframe from a csv file to JupyterLab. Afterwards I do some operations on it (adding more colums) and I write it back to a csv file. A lot of the columns are int64. In some of these columns many rows are empty.
Do you know a technique / specific dtype which I can apply on int64 columns in order to reduce the size of the dataframe and write it to a csv file more effient saving memory capacity and reduce the size of the csv file?
Would you provide me with some example of code?
[For columns containing strings only I changed the dtype to 'category'.]
thank you
If I understand your question correctly, the issue is the size of the csv file when you write it back to disk.
A csv file is just a text file, and as such the columns aren't stored with dtypes. It doesn't matter what you change your dtype to in pandas, it will be written back as characters. This makes csv very inefficient for storing large amounts of numerical data.
If you don't need it as a csv for some other reason, try a different file type such as parquet. (I have found this to reduce my file size by 10x, but it depends on your exact data.)
If you're specifically looking to convert dtypes, see this question, but as mentioned, this won't help your csv file size: Change column type in pandas

How to generate, then reduce, a massive set of DataFrames from each row of one DataFrame in PySpark?

I cannot share my actual code or data, unfortunately, as it is proprietary, but I can produce a MWE if the problem isn't clear to readers from the text.
I am working with a dataframe containing ~50 million rows, each of which contains a large XML document. From each XML document, I extract a list of statistics relating to the number of occurrences and hierarchical relationships between tags (nothing like undocumented XML formats to brighten one's day). I can express these statistics in dataframes, and I can combine these dataframes over multiple documents using standard operations like GROUP BY/SUM and DISTINCT. The goal is to extract the statistics for all 50 million documents and express them in a single dataframe.
The problem is that I don't know how to efficiently generate 50 million dataframes from each row of one dataframe in Spark, or how to tell Spark to reduce a list of 50 million dataframes to one dataframe using binary operators. Are there standard functions that do these things?
So far, the only workaround I have found is massively inefficient (storing the data as a string, parsing it, doing the computations, and then converting it back into a string). It would take weeks to finish using this method, so it isn't practical.
The extractions and statistical data from each XML response for each row can be stored in additional columns of the row itself. That way spark should be able to do the processes in its multiple executors improving the performance.
Here is a pseudocode.
from pyspark.sql.types import StructType, StructField, IntegerType,
StringType, DateType, FloatType, ArrayType
def extract_metrics_from_xml(row):
j = row['xmlResponse'] # assuming your xml column name is xmlResponse
# perform your xml extractions and computations for the xmlResponse in python
...
load_date = ...
stats_data1 = ...
return Row(load_date, stats_data1, stats_data2, stats_group)
schema = schema = StructType([StructField('load_date', DateType()),
StructField('stats_data1', FloatType()),
StructField('stats_data2', ArrayType(IntegerType())),
StructField('stats_group', StringType())
])
df_with_xml_stats = original_df.rdd\
.map(extract_metrics_from_xml)\
.toDF(schema=schema, sampleRatio=1)\
.cache()

How to write filenames based on a dask dataframe column?

I have a dask dataframe that I would like to save to s3. Each row in the dataframe as a "timestamp" column. I would like to partition the paths in s3 based on the dates in that timestamp column, so the output in s3 looks like this:
s3://....BUCKET_NAME/data/date=2019-01-01/part1.json.gz
s3://....BUCKET_NAME/data/date=2019-01-01/part2.json.gz
...
...
s3://....BUCKET_NAME/data/date=2019-05-01/part1.json.gz
Is this possible in dask? I can only find the name_function in the output that expects an integer as an input, and setting the column as an index doesnt add the index as part of the output filenames.
It's actually easy to achieve, as long as you are happy to save it as parquet, using partition_on. You should rename your folder from data to data.parquet if you want to read with dask.
df.to_parquet("s3://BUCKET_NAME/data.parquet/", partition_on=["timestamp"])
Not sure if it's the only or optimal way but you should be able to do it with groupby-apply, as in:
df.groupby('timestamp').apply(write_partition)
where write_partition is a function that takes a Pandas dataframe for a single timestamp and writes it to S3. Make sure you check the docs of apply as there are some gotchas (providing meta, full shuffling if the groupby column is not in the index, function called once per partition-group pair instead of once per group).

pandas read_sql not reading all rows

I am running the exact same query both through pandas' read_sql and through an external app (DbVisualizer).
DbVisualizer returns 206 rows, while pandas returns 178.
I have tried reading the data from pandas by chucks based on the information provided at How to create a large pandas dataframe from an sql query without running out of memory?, it didn't make a change.
What could be the cause for this and ways to remedy it?
The query:
select *
from rainy_days
where year=’2010’ and day=‘weekend’
The columns contain: date, year, weekday, amount of rain at that day, temperature, geo_location (row per location), wind measurements, amount of rain the day before, etc..
The exact python code (minus connection details) is:
import pandas
from sqlalchemy import create_engine
engine = create_engine(
'postgresql://user:pass#server.com/weatherhist?port=5439',
)
query = """
select *
from rainy_days
where year=’2010’ and day=‘weekend’
"""
df = pandas.read_sql(query, con=engine)
https://github.com/xzkostyan/clickhouse-sqlalchemy/issues/14
If you use pure engine.execute you should care about format manually
The problem is that pandas returns a packed dataframe (DF). For some reason this is always on by default and the results varies widely as to what is shown. The solution is to use the unpacking operator (*) before/when trying to print the df, like this:
print(*df)
(This is also know as the splat operator for Ruby enthusiasts.)
To read more about this, please check out these references & tutorials:
https://treyhunner.com/2018/10/asterisks-in-python-what-they-are-and-how-to-use-them/
https://www.geeksforgeeks.org/python-star-or-asterisk-operator/
https://medium.com/understand-the-python/understanding-the-asterisk-of-python-8b9daaa4a558
https://towardsdatascience.com/unpacking-operators-in-python-306ae44cd480
It's not a fix, but what worked for me was to rebuild the indices:
drop the indices
export the whole thing to a csv:
delete all the rows:
DELETE FROM table
import the csv back in
rebuild the indices
pandas:
df = read_csv(..)
df.to_sql(..)
If that works, then at least you know you have a problem somewhere with the indices keeping up to date.