I'm reading in this arff file to a pandas dataframe in Colab. I've used the following code, which seems to be fairly standard, from what a quick scan of top search results tells me.
from scipy.io.arff import loadarff
raw_data = loadarff('/speeddating.arff')
df = pd.DataFrame(raw_data[0])
When I inspect the dataframe, many of the values appear in this format: b'some_text'.
When I call type(df.iloc[0,0]) it returns bytes.
What is happening, and how do I get it to not be that way?
If anyone else stumbles upon this question, I found it answered here: Letter appeared in data when arff loaded into Python
Related
I was looking for the best way to collect all fields of the same type in pandas, similar to the style that works with spark data frames as below;
continuousCols = [c[0] for c in pysparkDf.dtypes if c[1] in ['int', 'double']]
I eventually figured it out with continuousCols = df.select_dtypes([float,int]).columns . If you have used any other method that works, feel free to add it as well.
I am trying to do the following:
Read in a .dat file with pandas, converting it to a dask dataframe, concatenate it to another dask dataframe that I read in from a parquet file, and then output to a new parquet file. I do the following:
import dask.dataframe as dd
import pandas as pd
hist_pth = "\path\to\hist_file"
hist_file = dd.read_parquet(hist_pth)
pth = "\path\to\file"
daily_file = pd.read_csv(pth, sep="|", encoding="latin")
daily_file = daily_file.astype(hist_file.dtypes.to_dict(), errors="ignore")
dask_daily_file = dd.from_pandas(daily_file, npartitions=1)
combined_file = dd.concat([dask_daily_file, hist_file])
output_path = "\path\to\output"
combined_file.to_parquet(output_path)
The combined_file.to_parquet(output_path) always starts and then stops / or doesn't work correctly. In a jupyter notebook when I do this I get a kernel fail error. When I do it in a python script the script completes but the whole combined file isn't written (I know because of the size - the CSV is 140MB and the parquet file is around 1GB - the output of to_parquet is only 20MB).
Some context, this is for an ETL process and with the amount of data were adding daily I'm soon going to run out of memory on the historical and combined datasets, so I'm trying to migrate the process from just pandas to Dask to handle the larger than memory data I will soon have. The current data, daily + historical, still fits in memory but just barely (I already make use of categoricals, these are stored in the parquet file and then I copy that schema to the new file).
I also noticed that after the dd.concat([dask_daily_file, hist_file]) that I am unable to call .compute() even on simple tasks without it crashing the same way it does when writing to parquet. For example, on the original, pre-concatenated data, I can call hist_file["Value"].div(100).compute() and get the expected value but the same method on combined_file crashes. Even just combined_file.compute() to turn it into a pandas df crashes. I have tried repartitioning as well with no luck.
I was able to do these exact operations, just in pandas, without issue. But again, I'm going to be running out of memory soon which is why I am moving to dask.
Is this something dask isn't able to handle? If it can handle it, am I processing it correctly? Specifically, it seems like the concat is causing issues. Any help appreciated!
UPDATE
After playing around more I ended up with the following error:
AttributeError: 'numpy.ndarray' object has no attribute 'categories'
There is an existing GitHub issue that seems like it could be related to this - i asked and am waiting for confirm.
As a work around I converted all categorical columns to strings/objects and tried again and then ended up with
ArrowTypeError: ("Expected a bytes object, got a 'int' object, 'Conversion failed for column Account with type object')
When I check that column df["Account"].dtype it returns dtype('O') so I think I have the correct dtype already. The values in this column are mainly numbers but there are some records with just letters.
Is there a way to resolve this?
I got this error in Pandas after concatenating dataframes and saving the result to Parquet format..
data = pd.concat([df_1, d2, df3], axis=0, ignore_index=True)
data.to_parquet(filename)
..apparently because the rows contained different data types, either int or float. By forcing them before saving to have the same data type the error goes away
cols = ["first affected col", "second affected col", ..]
data[cols] = data[cols].apply(pd.to_numeric, errors='coerce', axis=1)
I have a Dataframe containing a single column with a list of file names. I want to find all rows in the Dataframe that their value has a prefix from a set of know prefixes.
I know I can run a simple for loop, but I want to run in a Dataframe to check speeds and run benchmarks - it's also a nice exercise.
What I had in mind is combining str.slice with str.index but I can't get it to work. This is what I have in mind:
import pandas as pd
file_prefixes = {...}
file_df = pd.Dataframe(list_of_file_names)
file_df.loc[file_df.file.str.slice(start=0, stop=upload_df.file.str.index('/')-1).isin(file_prefixes), :] # this doesn't work as the index returns a dataframe
My hope is that said code will return all rows that the value there starts with a file prefix from the list above.
In summary, I would like help with 2 things:
Combining slice and index
Thoughts about better ways to achieve this
Thanks
I will use startswith
file_df.loc[file_df.file.str.startswith(tuple(file_prefixes)), :]
I am completely new in coding and started to experiment with python and pandas. Quite an adventure and I am learning a lot. I found a lot of solutions already here on Stack, but not for my latest quest.
With Pandas I imported and edited a txt-file in such a way that I could export it in a csv-file. But to be able to import this csv-file into another program I need that the header row starts on row number 20. So I actually need 19 empty rows.
Can somebody guide me in the right direction?
You can join your dataframe with an empty dataframe:
empty_data = [[''] * len(df.columns) for i in range(19)]
empty_df = pd.DataFrame(empty_data, columns=df.columns)
pd.concat((df, empty_df))
I've been tasked with automating access to an API of a third party vendor.
the third party vendor wants data in the format:
data = open(fname, 'rb').read()
yet I have data in a pandas DataFrame. What is the easiest way to go from a DataFrame to this 'data' value?
I spent a long time on this and I can literally not believe the best answer:
csv_string = df.to_csv()
literally just omit the filename and the output of to_csv will not write to a file. It was in the documentation.
If your dataframe is foo, then
foo.to_csv('filename') will work wonders.