Multiple, multi-value columns in pandas dataset - want to make multiple rows [duplicate] - pandas

This question already has answers here:
Split (explode) pandas dataframe string entry to separate rows
(27 answers)
Closed 4 years ago.
I have this following dataset from twitter in a pandas DataFrame.
app_clicks billed_charge_local_micro billed_engagements card_engagements ... retweets tweets_send unfollows url_clicks
0 None [422040000, 422040000, 422040000] [59, 65, 63] None ... [0, 2, 0] None [0, 0, 1] [65, 68, 67]
I want to make that three rows, but I'm not sure the best way to do that. Looked around and saw stuff like meld, merge and stack but nothing that really looks like it is for me.
Want it to be like this (don't care about index, just for visual purposes)
Index billed_charge_local_micro
0 422040000
1 422040000
2 422040000
Thanks.

you just use different functions of dataframe:
import pandas as pd
df2 = pd.DataFrame({ 'billed_charge_local_micro' : [[422040000, 422040000, 422040000]],
'other1': 10000,
'other2': 'abc'})
print(df2)
# billed_charge_local_micro other1 other2
# 0 [422040000, 422040000, 422040000] 10000 abc
df = df2['billed_charge_local_micro'].apply(pd.Series)
df = df.transpose()
df.columns = ["billed_charge_local_micro"]
print (df)
result final
billed_charge_local_micro
0 422040000
1 422040000
2 422040000

Related

cannot transform values in pandas dataframe using a mask [duplicate]

This question already has answers here:
How to deal with SettingWithCopyWarning in Pandas
(20 answers)
Closed 8 hours ago.
Here is an example to illustrate. I am doing something as follows:
import numpy as np
import pandas as pd
data = {'col_1': [3, 5, -1, 0], 'col_2': ['a', 'b', 'c', 'd']}
x = pd.DataFrame.from_dict(data)
mask = x['col_1'].values > 0
x[mask]['col_1'] = np.log(x[mask]['col_1'])
This comes back with:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Also, the dataframe remains unchanged.
Use DataFrame.loc for select and set column with condition:
mask = x['col_1'].values > 0
x.loc[mask, 'col_1'] = np.log(x.loc[mask, 'col_1'])
print (x)
col_1 col_2
0 1.098612 a
1 1.609438 b
2 -1.000000 c
3 0.000000 d

drop rows from a Pandas dataframe based on which rows have missing values in another dataframe

I'm trying to drop rows with missing values in any of several dataframes.
They all have the same number of rows, so I tried this:
model_data_with_NA = pd.concat([other_df,
standardized_numerical_data,
encode_categorical_data], axis=1)
ok_rows = ~(model_data_with_NA.isna().all(axis=1))
model_data = model_data_with_NA.dropna()
assert(sum(ok_rows) == len(model_data))
False!
As a newbie in Python, I wonder why this doesn't work? Also, is it better to use hierarchical indexing? Then I can extract the original columns from model_data.
In Short
I believe the all in ~(model_data_with_NA.isna().all(axis=1)) should be replaced with any.
The reason is that all checks here if every value in a row is missing, and any checks if one of the values is missing.
Full Example
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'a':[1, 2, 3]})
df2 = pd.DataFrame({'b':[1, np.nan]})
df3 = pd.DataFrame({'c': [1, 2, np.nan]})
model_data_with_na = pd.concat([df1, df2, df3], axis=1)
ok_rows = ~(model_data_with_na.isna().any(axis=1))
model_data = model_data_with_na.dropna()
assert(sum(ok_rows) == len(model_data))
model_data_with_na
a
b
c
0
1
1
1
1
2
nan
2
2
3
nan
nan
model_data
a
b
c
0
1
1
1

Easiest way to ignore or drop one header row from first page, when parsing table spanning several pages

I am parsing a PDF with tabula-py, and I need to ignore the first two tables, but then parse the rest of the tables as one, and export to a CSV. On the first relevant table (index 2) the first row is a header-row, and I want to leave this out of the csv.
See my code below, including my attempt at dropping the relevant row from the Pandas frame.
What is the easiest/most elegant way of achieving this?
tables = tabula.read_pdf('input.pdf', pages='all', multiple_tables=True)
f = open('output.csv', 'w')
# tables[2].drop(index=0) # tried this, but makes no difference
for df in tables[2:]:
df.to_csv(f, index=False, sep=';')
f.close()
Given the following toy dataframes:
import pandas as pd
tables = [
pd.DataFrame([[1, 3], [2, 4]]),
pd.DataFrame([["a", "b"], [1, 3], [2, 4]]),
]
for table in tables:
print(table)
# Ouput
0 1
0 1 3
1 2 4
0 1
0 a b <<< Unwanted row in table[1]
1 1 3
2 2 4
You can drop the first row of the second dataframe either by reassigning the resulting dataframe (preferable way):
tables[1] = tables[1].drop(index=0)
Or inplace:
tables[1].drop(index=0, inplace=True)
And so, in both cases:
print(table[1])
# Output
0 1
1 1 3
2 2 4

Assert an integer is in list on pandas series

I have a DataFrame with two pandas Series as follow:
value accepted_values
0 1 [1, 2, 3, 4]
1 2 [5, 6, 7, 8]
I would like to efficiently check if the value is in accepted_values using pandas methods.
I already know I can do something like the following, but I'm interested in a faster approach if there is one (took around 27 seconds on 1 million rows DataFrame)
import pandas as pd
df = pd.DataFrame({"value":[1, 2], "accepted_values": [[1,2,3,4], [5, 6, 7, 8]]})
def check_first_in_second(values: pd.Series):
return values[0] in values[1]
are_in_accepted_values = df[["value", "accepted_values"]].apply(
check_first_in_second, axis=1
)
if not are_in_accepted_values.all():
raise AssertionError("Not all value in accepted_values")
I think if create DataFrame with list column you can compare by DataFrame.eq and test if match at least one value per row by DataFrame.any:
df1 = pd.DataFrame(df["accepted_values"].tolist(), index=df.index)
are_in_accepted_values = df1.eq(df["value"]).any(axis=1).all()
Another idea:
are_in_accepted_values = all(v in a for v, a in df[["value", "accepted_values"]].to_numpy())
I found a little optimisation to your second idea. Using a bit more numpy than pandas makes it faster (more than 3x, tested with time.perf_counter()).
values = df["value"].values
accepted_values = df["accepted_values"].values
are_in_accepted_values = all(s in e for s, e in np.column_stack([values, accepted_values]))

How to use pandas rename() on multi-index columns?

How can can simply rename a MultiIndex column from a pandas DataFrame, using the rename() function?
Let's look at an example and create such a DataFrame:
import pandas
df = pandas.DataFrame({'A': [1, 1, 1, 2, 2], 'B': range(5), 'C': range(5)})
df = df.groupby("A").agg({"B":["min","max"],"C":"mean"})
print(df)
B C
min max mean
A
1 0 2 1.0
2 3 4 3.5
I am able to select a given MultiIndex column by using a tuple for its name:
print(df[("B","min")])
A
1 0
2 3
Name: (B, min), dtype: int64
However, when using the same tuple naming with the rename() function, it does not seem it is accepted:
df.rename(columns={("B","min"):"renamed"},inplace=True)
print(df)
B C
min max mean
A
1 0 2 1.0
2 3 4 3.5
Any idea how rename() should be called to deal with Multi-Index columns?
PS : I am aware of the other options to flatten the column names before, but this prevents one-liners so I am looking for a cleaner solution (see my previous question)
This doesn't answer the question as worded, but it will work for your given example (assuming you want them all renamed with no MultiIndex):
import pandas as pd
df = pd.DataFrame({'A': [1, 1, 1, 2, 2], 'B': range(5), 'C': range(5)})
df = df.groupby("A").agg(
renamed=('B', 'min'),
B_max=('B', 'max'),
C_mean=('C', 'mean'),
)
print(df)
renamed B_max C_mean
A
1 0 2 1.0
2 3 4 3.5
For more info, you can see the pandas docs and some related other questions.