I've been tasked with automating access to an API of a third party vendor.
the third party vendor wants data in the format:
data = open(fname, 'rb').read()
yet I have data in a pandas DataFrame. What is the easiest way to go from a DataFrame to this 'data' value?
I spent a long time on this and I can literally not believe the best answer:
csv_string = df.to_csv()
literally just omit the filename and the output of to_csv will not write to a file. It was in the documentation.
If your dataframe is foo, then
foo.to_csv('filename') will work wonders.
Related
I'm reading in this arff file to a pandas dataframe in Colab. I've used the following code, which seems to be fairly standard, from what a quick scan of top search results tells me.
from scipy.io.arff import loadarff
raw_data = loadarff('/speeddating.arff')
df = pd.DataFrame(raw_data[0])
When I inspect the dataframe, many of the values appear in this format: b'some_text'.
When I call type(df.iloc[0,0]) it returns bytes.
What is happening, and how do I get it to not be that way?
If anyone else stumbles upon this question, I found it answered here: Letter appeared in data when arff loaded into Python
I was looking for the best way to collect all fields of the same type in pandas, similar to the style that works with spark data frames as below;
continuousCols = [c[0] for c in pysparkDf.dtypes if c[1] in ['int', 'double']]
I eventually figured it out with continuousCols = df.select_dtypes([float,int]).columns . If you have used any other method that works, feel free to add it as well.
Coming from Python, I started using Julia for its speed in a big-data project. When reading data from .xlsx files, the datatype in each column is "any", despite most of the data being integers or floats.
Is there any Julia-way of inferring the datatypes in a DataFrame (like df = infertype.(df))?
This may be difficult in Julia, given the reduced flexibility on dataypes, but any tips on how to accomplish it would be appreciated. Assume, ex-ante, I do not know which column is which, but the types can only be int, float, string or date.
Using DataFrames
Using XLSX
df = DataFrame(XLSX.readtable("MyFile.xlsx", "Sheet1")...)
You can just do:
df = DataFrame(XLSX.readtable("MyFile.xlsx", "Sheet1"; infer_eltypes=true)...)
Additionally, it is worth knowing that typing in Julia ? before the command shows help that can contain such information:
help?> XLSX.readtable
readtable(filepath, sheet, [columns]; [first_row], [column_labels], [header], [infer_eltypes], [stop_in_empty_row], [stop_in_row_function]) -> data, column_labels
Returns tabular data from a spreadsheet as a tuple (data, column_labels). (...)
(...)
Use infer_eltypes=true to get data as a Vector{Any} of typed vectors. The default value is infer_eltypes=false.
(...)
I need to save a bunch of PySpark DataFrames as csv tables. The tables should also have the same names as the DataFrames.
The code should be something like that:
for table in ['ag01','a5bg','h68chl', 'df42', 'gh63', 'ur55', 'ui99']:
ppath='hdfs://hadoopcentralprodlab01/..../data/'+table+'.csv'
table.repartition(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").save(ppath)
The problem is here that in the command "table.repartition(1)..." I need the actual names of data frames without ''. So in this form the code doesn't work. But If write "for table in [ag01,a5bg,...]", so without quotes in the list, I then cannot define the path because I cannot concantenate the name of data frame and a string. How can I resolve this dilemma?
Thanks in advance!
Having a bunch of variable names not considered good coding practice. You should have used a list or a dictionary in the first place. But if you're stuck in this already, you can use eval to get the dataframe stored in that variable.
for table in ['ag01', 'a5bg', 'h68chl', 'df42', 'gh63', 'ur55', 'ui99']:
ppath = 'hdfs://hadoopcentralprodlab01/..../data/'+table+'.csv'
df = eval(table)
df.repartition(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").save(ppath)
I am trying to do the following:
Read in a .dat file with pandas, converting it to a dask dataframe, concatenate it to another dask dataframe that I read in from a parquet file, and then output to a new parquet file. I do the following:
import dask.dataframe as dd
import pandas as pd
hist_pth = "\path\to\hist_file"
hist_file = dd.read_parquet(hist_pth)
pth = "\path\to\file"
daily_file = pd.read_csv(pth, sep="|", encoding="latin")
daily_file = daily_file.astype(hist_file.dtypes.to_dict(), errors="ignore")
dask_daily_file = dd.from_pandas(daily_file, npartitions=1)
combined_file = dd.concat([dask_daily_file, hist_file])
output_path = "\path\to\output"
combined_file.to_parquet(output_path)
The combined_file.to_parquet(output_path) always starts and then stops / or doesn't work correctly. In a jupyter notebook when I do this I get a kernel fail error. When I do it in a python script the script completes but the whole combined file isn't written (I know because of the size - the CSV is 140MB and the parquet file is around 1GB - the output of to_parquet is only 20MB).
Some context, this is for an ETL process and with the amount of data were adding daily I'm soon going to run out of memory on the historical and combined datasets, so I'm trying to migrate the process from just pandas to Dask to handle the larger than memory data I will soon have. The current data, daily + historical, still fits in memory but just barely (I already make use of categoricals, these are stored in the parquet file and then I copy that schema to the new file).
I also noticed that after the dd.concat([dask_daily_file, hist_file]) that I am unable to call .compute() even on simple tasks without it crashing the same way it does when writing to parquet. For example, on the original, pre-concatenated data, I can call hist_file["Value"].div(100).compute() and get the expected value but the same method on combined_file crashes. Even just combined_file.compute() to turn it into a pandas df crashes. I have tried repartitioning as well with no luck.
I was able to do these exact operations, just in pandas, without issue. But again, I'm going to be running out of memory soon which is why I am moving to dask.
Is this something dask isn't able to handle? If it can handle it, am I processing it correctly? Specifically, it seems like the concat is causing issues. Any help appreciated!
UPDATE
After playing around more I ended up with the following error:
AttributeError: 'numpy.ndarray' object has no attribute 'categories'
There is an existing GitHub issue that seems like it could be related to this - i asked and am waiting for confirm.
As a work around I converted all categorical columns to strings/objects and tried again and then ended up with
ArrowTypeError: ("Expected a bytes object, got a 'int' object, 'Conversion failed for column Account with type object')
When I check that column df["Account"].dtype it returns dtype('O') so I think I have the correct dtype already. The values in this column are mainly numbers but there are some records with just letters.
Is there a way to resolve this?
I got this error in Pandas after concatenating dataframes and saving the result to Parquet format..
data = pd.concat([df_1, d2, df3], axis=0, ignore_index=True)
data.to_parquet(filename)
..apparently because the rows contained different data types, either int or float. By forcing them before saving to have the same data type the error goes away
cols = ["first affected col", "second affected col", ..]
data[cols] = data[cols].apply(pd.to_numeric, errors='coerce', axis=1)