I'm trying to create a column, that changes its value for every 1/True in target column, and keeps previous value for 0/False. So for example how to get from this
a = pl.DataFrame({'a': [1, 0, 0, 0, 1, 0, 0, 1]})
print(a)
┌─────┐
│ a │
│ --- │
│ i64 │
╞═════╡
│ 1 │
├╌╌╌╌╌┤
│ 0 │
├╌╌╌╌╌┤
│ 0 │
├╌╌╌╌╌┤
│ 0 │
├╌╌╌╌╌┤
│ 1 │
├╌╌╌╌╌┤
│ 0 │
├╌╌╌╌╌┤
│ 0 │
├╌╌╌╌╌┤
│ 1 │
├╌╌╌╌╌┤
│ 1 │
└─────┘
this dataframe
┌─────┬────────────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪════════════╡
│ 1 ┆ new_value1 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0 ┆ new_value1 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0 ┆ new_value1 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0 ┆ new_value1 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ new_value2 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0 ┆ new_value2 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0 ┆ new_value2 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ new_value3 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ new_value4 │
└─────┴────────────┘
PS: adding some text so post is not mostly code.
In polars, fold, reduce, cumfold and cumreduce are horizontal expressions. Meaning that they operate op columns, not on elements.
To achieve what you want, you can use cumsum to get a monotonically increasing integer on every True value.
Then we combine that result with the format expression to get the string output you want.
a.with_column(
pl.format("new_value_{}", pl.col("a").cumsum())
)
shape: (8, 2)
┌─────┬─────────────┐
│ a ┆ literal │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════════════╡
│ 1 ┆ new_value_1 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0 ┆ new_value_1 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0 ┆ new_value_1 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0 ┆ new_value_1 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ new_value_2 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0 ┆ new_value_2 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0 ┆ new_value_2 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ new_value_3 │
└─────┴─────────────┘
Related
If I have two tables A and B, how can I do a join of B into A in place so that A keeps all its data and is modified by that join without having to make a copy?
And can that join only take specified columns from B into A?
A:
┌─────┬─────┬───────┐
│ one ┆ two ┆ three │
╞═════╪═════╪═══════╡
│ a ┆ 1 ┆ 3 │
│ b ┆ 4 ┆ 6 │
│ c ┆ 7 ┆ 9 │
│ d ┆ 10 ┆ 12 │
│ e ┆ 13 ┆ 15 │
│ f ┆ 16 ┆ 18 │
└─────┴─────┴───────┘
B:
┌─────┬─────┬───────┬──────┐
│ one ┆ two ┆ three ┆ four │
╞═════╪═════╪═══════╪══════╡
│ a ┆ 1 ┆ 3 ┆ yes │
│ c ┆ 7 ┆ 9 ┆ yes │
│ f ┆ 16 ┆ 18 ┆ yes │
└─────┴─────┴───────┴──────┘
I'd like to left join A and B, keeping all data in A and the four column of B - renamed as result.
With data.table I can do exactly this after reading A and B:
address(A)
# [1] "0x55fc74197910"
A[B, on = .(one, two), result := i.four]
A
# one two three result
# 1: a 1 3 yes
# 2: b 4 6 <NA>
# 3: c 7 9 yes
# 4: d 10 12 <NA>
# 5: e 13 15 <NA>
# 6: f 16 18 yes
address(A)
# [1] "0x55fc74197910"
With polars in python:
A.join(B, on = ["one", "two"], how = 'left')
# shape: (6, 5)
# ┌─────┬─────┬───────┬─────────────┬──────┐
# │ one ┆ two ┆ three ┆ three_right ┆ four │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═══════╪═════════════╪══════╡
# │ a ┆ 1 ┆ 3 ┆ 3 ┆ yes │
# │ b ┆ 4 ┆ 6 ┆ null ┆ null │
# │ c ┆ 7 ┆ 9 ┆ 9 ┆ yes │
# │ d ┆ 10 ┆ 12 ┆ null ┆ null │
# │ e ┆ 13 ┆ 15 ┆ null ┆ null │
# │ f ┆ 16 ┆ 18 ┆ 18 ┆ yes │
# └─────┴─────┴───────┴─────────────┴──────┘
A
# shape: (6, 3)
# ┌─────┬─────┬───────┐
# │ one ┆ two ┆ three │
# │ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ i64 │
# ╞═════╪═════╪═══════╡
# │ a ┆ 1 ┆ 3 │
# │ b ┆ 4 ┆ 6 │
# │ c ┆ 7 ┆ 9 │
# │ d ┆ 10 ┆ 12 │
# │ e ┆ 13 ┆ 15 │
# │ f ┆ 16 ┆ 18 │
# └─────┴─────┴───────┘
A is unchanged. If A is assigned again:
id(A)
# 139703375023552
A = A.join(B, on = ['one', 'two'], right_on=["four"])
id(A)
# 139703374967280
its memory address changes.
There is indeed no copy occurring there; if you think of the DataFrame class as a container (like a python list), you can see the same sort of thing happening here - the container id changes, but the contents of the container are not copied:
# create a list/container with some object data
v1 = [object(), object(), object()]
print(v1)
# [<object at 0x1686b6510>, <object at 0x1686b6490>, <object at 0x1686b6550>]
v2 = v1[:]
print(v2)
# [<object at 0x1686b6510>, <object at 0x1686b6490>, <object at 0x1686b6550>]
v3 = v1[:2]
print(v3)
# [<object at 0x1686b6510>, <object at 0x1686b6490>]
(Each of v1, v2, and v3 will have different ids).
I want to update a polars library dataframe,
polars syntax/command which I used for the purpose:
df[0, 'A'] = 'some value'
but the above code gives an error:
ValueError: cannot set with list/tuple as value; use a scalar value
I am using polars 0.13.55
The above coode was previously working in polars 0.13.51
Minimal Code to reproduce the problem:
df = pl.DataFrame( { "IP": ['1.1.1.1', '2.2.2.2'], "ISP" :
["N/A", "N/A"] } )
isp_names = { '1.1.1.1' : 'ABC', '2.2.2.2' : 'XYZ' }
i = 0
for row in df.rows():
for ip, isp in isp_names.items():
if(row[0] == ip):
df[i, 'ISP'] = isp #**This line gives the Value error**
i = i + 1
It looks as though you might be trying to update the values of DataFrame, particularly where values are missing (the "N/A" values).
In addition the advice of #jvz, I would recommend using a left join for your purposes, rather than using a dictionary and a for loop. Using for loops is very slow, and is to be avoided. By contrast, a left join will be very performant, and is built for exactly these types of situations.
We'll take this in steps.
First, let's first expand your example.
df = pl.DataFrame(
{"IP": ["1.1.1.1", "2.2.2.2", "3.3.3.3", "4.4.4.4"],
"ISP": ["N/A", "N/A", "PQR", "N/A"]}
)
df
shape: (4, 2)
┌─────────┬─────┐
│ IP ┆ ISP │
│ --- ┆ --- │
│ str ┆ str │
╞═════════╪═════╡
│ 1.1.1.1 ┆ N/A │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2.2.2.2 ┆ N/A │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 3.3.3.3 ┆ PQR │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 4.4.4.4 ┆ N/A │
└─────────┴─────┘
Notice that we have three rows with "N/A" values, but one row that already has a valid value, "PQR".
Next, let's convert your dictionary of updated ISP values to a DataFrame, so that we can join the two DataFrames.
isp_df = pl.DataFrame(
data=[[key, value] for key, value in isp_names.items()],
columns=["IP", "ISP_updated"],
orient="row",
)
isp_df
shape: (2, 2)
┌─────────┬─────────────┐
│ IP ┆ ISP_updated │
│ --- ┆ --- │
│ str ┆ str │
╞═════════╪═════════════╡
│ 1.1.1.1 ┆ ABC │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2.2.2.2 ┆ XYZ │
└─────────┴─────────────┘
Now, we simply join the two DataFrames. The how="left" ensures that we keep all rows from df, even if there are no corresponding rows in isp_df.
df.join(isp_df, on="IP", how="left")
shape: (4, 3)
┌─────────┬─────┬─────────────┐
│ IP ┆ ISP ┆ ISP_updated │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════════╪═════╪═════════════╡
│ 1.1.1.1 ┆ N/A ┆ ABC │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2.2.2.2 ┆ N/A ┆ XYZ │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3.3.3.3 ┆ PQR ┆ null │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4.4.4.4 ┆ N/A ┆ null │
└─────────┴─────┴─────────────┘
Notice the null values in ISP_updated. These are cases where you had no updated values for a particular IP value.
To complete the process, we use fill_null to copy the values from the ISP column into the ISP_updated column for those cases where isp_df had no updates for a particular IP value.
(
df
.join(isp_df, on="IP", how="left")
.with_column(
pl.col("ISP_updated").fill_null(pl.col("ISP"))
)
)
shape: (4, 3)
┌─────────┬─────┬─────────────┐
│ IP ┆ ISP ┆ ISP_updated │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════════╪═════╪═════════════╡
│ 1.1.1.1 ┆ N/A ┆ ABC │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2.2.2.2 ┆ N/A ┆ XYZ │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3.3.3.3 ┆ PQR ┆ PQR │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4.4.4.4 ┆ N/A ┆ N/A │
└─────────┴─────┴─────────────┘
Now, your ISP_updated column has the updated values for each ISP. If you want, you can drop and rename columns so that your final column is labeled ISP.
(
df
.join(isp_df, on="IP", how="left")
.with_column(
pl.col("ISP_updated").fill_null(pl.col("ISP"))
)
.drop("ISP")
.rename({"ISP_updated": "ISP"})
)
shape: (4, 2)
┌─────────┬─────┐
│ IP ┆ ISP │
│ --- ┆ --- │
│ str ┆ str │
╞═════════╪═════╡
│ 1.1.1.1 ┆ ABC │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2.2.2.2 ┆ XYZ │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 3.3.3.3 ┆ PQR │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 4.4.4.4 ┆ N/A │
└─────────┴─────┘
As the size of your DataFrames gets large, you will definitely want to avoid using for loops. Using join will be far faster.
I am unable to reproduce the error on the latest version (0.13.56), so updating polars may help.
May I also suggest two improvements to the code, where the second improvements avoids the issue you run into altogether?
First, a more Pythonic version:
df = pl.DataFrame( {"IP": ['1.1.1.1', '2.2.2.2'],
"ISP": ["N/A", "N/A"] } )
isp_names = { '1.1.1.1' : 'ABC', '2.2.2.2' : 'XYZ' }
for i, row in enumerate(df.rows()):
df[i, 'ISP'] = isp_names[row[0]]
I.e., use enumerate to keep your i aligned with row, and do not loop isp_names separately but simply get the value by the key.
Second, Polars has an excellent expression system, meaning you do not have to pre-allocate ISP column or write a loop:
df = pl.DataFrame( { "IP": ['1.1.1.1', '2.2.2.2']})
isp_names = { '1.1.1.1' : 'ABC', '2.2.2.2' : 'XYZ' }
df.with_column(pl.col("IP").apply(isp_names.get).alias("ISP"))
which returns df as:
shape: (2, 2)
┌─────────┬─────┐
│ IP ┆ ISP │
│ --- ┆ --- │
│ str ┆ str │
╞═════════╪═════╡
│ 1.1.1.1 ┆ ABC │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2.2.2.2 ┆ XYZ │
└─────────┴─────┘
Based on the document of polars, one can use json_path_match to extract JSON fields into string series.
But can we do something like pandas.Series.map(json.loads) to convert the whole JSON string at once? One can then further convert the loaded JSON series into another dataframe with sane dtypes.
I know I can do it first in pandas, but I'm looking for a way in polars.
I should first point out that there is a polars.read_json method. For example:
import polars as pl
import io
json_file = """[{"a":"1", "b":10, "c":[1,2,3]},
{"a":"2", "b":20, "c":[3,4,5]},
{"a":"3.1", "b":30.2, "c":[8,8,8]},
{"a":"4", "b":40.0, "c":[9,9,90]}]
"""
pl.read_json(io.StringIO(json_file))
shape: (4, 3)
┌─────┬──────┬────────────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ list [i64] │
╞═════╪══════╪════════════╡
│ 1 ┆ 10.0 ┆ [1, 2, 3] │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 20.0 ┆ [3, 4, 5] │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3.1 ┆ 30.2 ┆ [8, 8, 8] │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 40.0 ┆ [9, 9, 90] │
└─────┴──────┴────────────┘
But to answer your specific question about JSON data already loaded into a Series, I think what you're looking for is the polars.Series.apply method, which will apply a callable function to each cell of a Polars Series.
For example, let's say we have the following JSON fields already loaded into a Series in a Polars DataFrame:
import json
import polars as pl
df = pl.DataFrame(
{
"json_val": [
'{"a":"1", "b":10, "c":[1,2,3]}',
'{"a":"2", "b":20, "c":[3,4,5]}',
'{"a":"3.1", "b":30.2, "c":[8,8,8]}',
'{"a":"4", "b":40.0, "c":[9,9,90]}',
]
}
)
print(df)
shape: (4, 1)
┌─────────────────────────────────────┐
│ json_val │
│ --- │
│ str │
╞═════════════════════════════════════╡
│ {"a":"1", "b":10, "c":[1,2,3]} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"a":"2", "b":20, "c":[3,4,5]} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"a":"3.1", "b":30.2, "c":[8,8,8... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"a":"4", "b":40.0, "c":[9,9,90]... │
└─────────────────────────────────────┘
We can use apply and the json.loads function. In this example, that will yield a Series of type struct:
df.select(pl.col("json_val").apply(json.loads))
shape: (4, 1)
┌──────────────────────────┐
│ json_val │
│ --- │
│ struct[3]{'a', 'b', 'c'} │
╞══════════════════════════╡
│ {"1",10,[1, 2, 3]} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"2",20,[3, 4, 5]} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"3.1",30,[8, 8, 8]} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"4",40,[9, 9, 90]} │
└──────────────────────────┘
(One caution, notice how column b has been truncated to an integer.)
Depending on the structure of your JSON, you may be able to also use the polars.DataFrame.unnest function to split the json_val struct column into separate columns.
df.select(pl.col("json_val").apply(json.loads)).unnest("json_val")
shape: (4, 3)
┌─────┬─────┬────────────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ list [i64] │
╞═════╪═════╪════════════╡
│ 1 ┆ 10 ┆ [1, 2, 3] │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 20 ┆ [3, 4, 5] │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3.1 ┆ 30 ┆ [8, 8, 8] │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 40 ┆ [9, 9, 90] │
└─────┴─────┴────────────┘
Does this help get you started?
Edit: handling type-conversion issues
One general strategy that I use with any un-typed input file (especially csv files) is to return all values as a string/polars.Utf8 type. That way, I can explicitly convert types later, after I've had a chance to visually inspect the results. (I've been burned too often by "automatic" type conversions.)
The json.loads method has two helpful keyword options parse_float and parse_int that will help in this case. We can use a simple lambda function to tell the json parser to leave integer and float columns as strings.
# define our own translate function to keep floats/ints as strings
def json_translate(json_str: str):
return json.loads(json_str, parse_float=lambda x: x, parse_int=lambda x: x)
df.select(pl.col("json_val").apply(f=json_translate))
shape: (4, 1)
┌────────────────────────────────┐
│ json_val │
│ --- │
│ struct[3]{'a', 'b', 'c'} │
╞════════════════════════════════╡
│ {"1","10",["1", "2", "3"]} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"2","20",["3", "4", "5"]} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"3.1","30.2",["8", "8", "8"]} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"4","40.0",["9", "9", "90"]} │
└────────────────────────────────┘
Notice that all the integer and float values are left as strings, and remain so when we use the unnest function (the column headers below show "str").
df.select(pl.col("json_val").apply(f=json_translate)).unnest('json_val')
shape: (4, 3)
┌─────┬──────┬──────────────────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ list [str] │
╞═════╪══════╪══════════════════╡
│ 1 ┆ 10 ┆ ["1", "2", "3"] │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 20 ┆ ["3", "4", "5"] │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3.1 ┆ 30.2 ┆ ["8", "8", "8"] │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 40.0 ┆ ["9", "9", "90"] │
└─────┴──────┴──────────────────┘
From this point, you can use Polars' cast expression to convert the strings to the specific numeric types that you want. Here's a Stack Overflow question that can help with the cast.
I am looking for the opposite of the dropmissing function in DataFrames.jl so that the user knows where to look to fix their bad data. It seems like this should be easy, but the filter function expects a column to be specified and I cannot get it to iterate over all columns.
julia> df=DataFrame(a=[1, missing, 3], b=[4, 5, missing])
3×2 DataFrame
│ Row │ a │ b │
│ │ Int64? │ Int64? │
├─────┼─────────┼─────────┤
│ 1 │ 1 │ 4 │
│ 2 │ missing │ 5 │
│ 3 │ 3 │ missing │
julia> filter(x -> ismissing(eachcol(x)), df)
ERROR: MethodError: no method matching eachcol(::DataFrameRow{DataFrame,DataFrames.Index})
julia> filter(x -> ismissing.(x), df)
ERROR: ArgumentError: broadcasting over `DataFrameRow`s is reserved
I am basically trying to recreate the disallowmissing function, but with a more useful error message.
Here are two ways to do it:
julia> df = DataFrame(a=[1, missing, 3], b=[4, 5, missing])
3×2 DataFrame
│ Row │ a │ b │
│ │ Int64? │ Int64? │
├─────┼─────────┼─────────┤
│ 1 │ 1 │ 4 │
│ 2 │ missing │ 5 │
│ 3 │ 3 │ missing │
julia> df[.!completecases(df), :] # this will be faster
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64? │ Int64? │
├─────┼─────────┼─────────┤
│ 1 │ missing │ 5 │
│ 2 │ 3 │ missing │
julia> #view df[.!completecases(df), :]
2×2 SubDataFrame
│ Row │ a │ b │
│ │ Int64? │ Int64? │
├─────┼─────────┼─────────┤
│ 1 │ missing │ 5 │
│ 2 │ 3 │ missing │
julia> filter(row -> any(ismissing, row), df)
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64? │ Int64? │
├─────┼─────────┼─────────┤
│ 1 │ missing │ 5 │
│ 2 │ 3 │ missing │
julia> filter(row -> any(ismissing, row), df, view=true) # requires DataFrames.jl 0.22
2×2 SubDataFrame
Row │ a b
│ Int64? Int64?
─────┼──────────────────
1 │ missing 5
2 │ 3 missing
I did one small experiment and I got to know that it is just because the different data types of columns include in CSV. please see the following code
julia> using DataFrames
julia> df = DataFrame(:a => [1.0, 2, missing, missing, 5.0], :b => [1.1, 2.2, 3, missing, 5],:c => [1,3,5,missing,6])
5×3 DataFrame
│ Row │ a │ b │ c │
│ │ Float64? │ Float64? │ Int64? │
├─────┼──────────┼──────────┼─────────┤
│ 1 │ 1.0 │ 1.1 │ 1 │
│ 2 │ 2.0 │ 2.2 │ 3 │
│ 3 │ missing │ 3.0 │ 5 │
│ 4 │ missing │ missing │ missing │
│ 5 │ 5.0 │ 5.0 │ 6 │
julia> df
5×3 DataFrame
│ Row │ a │ b │ c │
│ │ Float64? │ Float64? │ Int64? │
├─────┼──────────┼──────────┼─────────┤
│ 1 │ 1.0 │ 1.1 │ 1 │
│ 2 │ 2.0 │ 2.2 │ 3 │
│ 3 │ missing │ 3.0 │ 5 │
│ 4 │ missing │ missing │ missing │
│ 5 │ 5.0 │ 5.0 │ 6 │
julia> using Impute
julia> Impute.interp(df)
ERROR: InexactError: Int64(5.5)
Stacktrace:
[1] Int64 at ./float.jl:710 [inlined]
[2] convert at ./number.jl:7 [inlined]
[3] convert at ./missing.jl:69 [inlined]
[4] setindex! at ./array.jl:826 [inlined]
[5] (::Impute.var"#58#59"{Int64,Array{Union{Missing, Int64},1}})(::Impute.Context) at /home/synerzip/.julia/packages/Impute/GmIMg/src/imputors/interp.jl:67
[6] (::Impute.Context)(::Impute.var"#58#59"{Int64,Array{Union{Missing, Int64},1}}) at /home/synerzip/.julia/packages/Impute/GmIMg/src/context.jl:227
[7] _impute!(::Array{Union{Missing, Int64},1}, ::Impute.Interpolate) at /home/synerzip/.julia/packages/Impute/GmIMg/src/imputors/interp.jl:49
[8] impute!(::Array{Union{Missing, Int64},1}, ::Impute.Interpolate) at /home/synerzip/.julia/packages/Impute/GmIMg/src/imputors.jl:84
[9] impute!(::DataFrame, ::Impute.Interpolate) at /home/synerzip/.julia/packages/Impute/GmIMg/src/imputors.jl:172
[10] #impute#17 at /home/synerzip/.julia/packages/Impute/GmIMg/src/imputors.jl:76 [inlined]
[11] impute at /home/synerzip/.julia/packages/Impute/GmIMg/src/imputors.jl:76 [inlined]
[12] _impute(::DataFrame, ::Type{Impute.Interpolate}) at /home/synerzip/.julia/packages/Impute/GmIMg/src/imputors.jl:58
[13] #interp#105 at /home/synerzip/.julia/packages/Impute/GmIMg/src/Impute.jl:84 [inlined]
[14] interp(::DataFrame) at /home/synerzip/.julia/packages/Impute/GmIMg/src/Impute.jl:84
[15] top-level scope at REPL[15]:1
and this error does not occur when I run the following code
julia> df = DataFrame(:a => [1.0, 2, missing, missing, 5.0], :b => [1.1, 2.2, 3, missing, 5])
5×2 DataFrame
│ Row │ a │ b │
│ │ Float64? │ Float64? │
├─────┼──────────┼──────────┤
│ 1 │ 1.0 │ 1.1 │
│ 2 │ 2.0 │ 2.2 │
│ 3 │ missing │ 3.0 │
│ 4 │ missing │ missing │
│ 5 │ 5.0 │ 5.0 │
julia> Impute.interp(df)
5×2 DataFrame
│ Row │ a │ b │
│ │ Float64? │ Float64? │
├─────┼──────────┼──────────┤
│ 1 │ 1.0 │ 1.1 │
│ 2 │ 2.0 │ 2.2 │
│ 3 │ 3.0 │ 3.0 │
│ 4 │ 4.0 │ 4.0 │
│ 5 │ 5.0 │ 5.0 │
now I know the reason but confused about how to solve it. I can not use eltype while reading CSV because in my dataset contains 171 columns and it typically has either Int or Float. stuck for how to convert all columns in Float64.
I assume you want:
something simple, that does not have to be maximally efficient
all your columns are numeric (possibly having missing values)
Then just write:
julia> df
5×3 DataFrame
│ Row │ a │ b │ c │
│ │ Float64? │ Float64? │ Int64? │
├─────┼──────────┼──────────┼─────────┤
│ 1 │ 1.5 │ 1.65 │ 1 │
│ 2 │ 3.0 │ 3.3 │ 3 │
│ 3 │ missing │ 4.5 │ 5 │
│ 4 │ missing │ missing │ missing │
│ 5 │ 7.5 │ 7.5 │ 6 │
julia> float.(df)
5×3 DataFrame
│ Row │ a │ b │ c │
│ │ Float64? │ Float64? │ Float64? │
├─────┼──────────┼──────────┼──────────┤
│ 1 │ 1.5 │ 1.65 │ 1.0 │
│ 2 │ 3.0 │ 3.3 │ 3.0 │
│ 3 │ missing │ 4.5 │ 5.0 │
│ 4 │ missing │ missing │ missing │
│ 5 │ 7.5 │ 7.5 │ 6.0 │
It is possible to be more efficient (i.e. convert only the columns that are integer in the source data frame, but it requires more code - please comment if you need such a solution)
EDIT
Also note that CSV.jl has typemap keyword argument that should allow to handle this issue when reading the data in.