Split html table to smaller Pandas DataFrames - pandas

I'm trying to parse html tables from page ukwtv.de to Pandas DataFrames
The challange is that in one table there are combined 2 or even 3 tables together
From table
TV program name and SID as df1,
Kanal, Standort, etc. as df2,
Technische Details as df3,
Here what I managed to achieve so far:
table_MN = pd.read_html('https://www.ukwtv.de/cms/deutschland-tv/schleswig-holstein-tv.html', thousands='.', decimal=',')
df1 = table_MN[1]
df1.columns = df1.columns.str.replace(" ", "_")
df1.columns = df1.columns.str.replace("\n", "_")
df1=df1.iloc[:7 , :]
for col in df1.columns:
print(col)
if '.' in col:
df1.drop(col, axis=1, inplace=True)
df1.dropna(subset = ["TV-_und_Radio-Programme_des_Bouquets"],axis=0, inplace=True)
df1.head(15)
df2 = table_MN[1]
df2.columns = df2.iloc[7]
df2 = df2.iloc[8: , :]
df2 = df2.reset_index(drop=True)
df2.head(20)
To issue which I have problem to solve
row 7 is hardcoded how to recodnize blank line to split data i two dataframes?
Technische Details column in df1 need to be convered to separete dataframe where Modulation, Guardintervall, ... are Series names

Related

Override non null values from one dataframe to another

I would like to override non null values from a dataframe to another dataframe with combination of first row and column (both being unique).
Basically, i am trying to join df2 on df1 only for non null values in df2, keeping df1 rows/column intact.
eg:
df1 =
df2 =
output =
This should work:
output = df1.merge(df2, on='ID')
cols = [c for c in df1.columns if c!='ID']
for col in cols:
output[col] = output[f'{col}_x'].fillna(output[f'{col}_y'])
output.drop(columns=[f'{col}_x', f'{col}_y'], inplace=True)
Explanation:
At first, we merge two dataframes using ID as a key. The merge joins two dataframes and if there are columns with the same name it adds suffixes _x and _y.
Then we iterate over all the columns in df1 and fill the NA values in the column col_x using on the values in col_y and put the value into a new column col.
We drop the auxiliary columns col_x and col_y
Edit:
Still, even with the updated requirements the approach is similar. However, in this case, you need to perform a left outer join and fillna values of the second dataframe. Here is the code:
output = df1.merge(df2, on='ID', how='left')
cols = [c for c in df1.columns if c!='ID']
for col in cols:
output[col] = output[f'{col}_y'].fillna(output[f'{col}_x'])
output.drop(columns=[f'{col}_x', f'{col}_y'], inplace=True)

How to search a data frame and remove items that match another data frame

I have two dataframes:
df1 = names: Tom, Nick, Pat, Frank
df2 = names: Tom, Nick
I would like to make a df3 by having df2 search through df1 and remove matches so I am left with a new dataframe:
df3 = names: Pat, Frank
You can do:
df3 = df1[~df1['names'].isin(df2['names'])]
This checks each name in df1 to see if it is in df2, then takes the opposite of the boolean result, and filters df1 based on those resulting bools.

Combine a list of pandas dataframes that do not have the same columns to one pandas dataframe

I have three Dataframes : df1, df2, df3 with the same number of "rows" but different number of "columns" and different "column" labels. I want to "merge" them in one single dataframe with the order df1,df2,df3 and keeping the original column labels.
I've read in Combine a list of pandas dataframes to one pandas dataframe that this can be done by:
df = pd.DataFrame.from_dict(map(dict,df_list))
But I cannot fully understand the code. I assume df_list is:
df_list = [df1,df2,df3]
But what is dict? A dictionary of df_list? How to get it?
I solve this by:
df = pd.concat([df1, df2], axis=1, sort=False)
df = pd.concat([df, df3], axis=1, sort=False)

Conditional on pandas DataFrame's

Let df1, df2, and df3 are pandas.DataFrame's having the same structure but different numerical values. I want to perform:
res=if df1>1.0: (df2-df3)/(df1-1) else df3
res should have the same structure as df1, df2, and df3 have.
numpy.where() generates result as a flat array.
Edit 1:
res should have the same indices as df1, df2, and df3 have.
For example, I can access df2 as df2["instanceA"]["parameter1"]["paramter2"]. I want to access the new calculated DataFrame/Series res as res["instanceA"]["parameter1"]["paramter2"].
Actually numpy.where should work fine there. Output here is 4x2 (same as df1, df2, df3).
df1 = pd.DataFrame( np.random.randn(4,2), columns=list('xy') )
df2 = pd.DataFrame( np.random.randn(4,2), columns=list('xy') )
df3 = pd.DataFrame( np.random.randn(4,2), columns=list('xy') )
res = df3.copy()
res[:] = np.where( df1 > 1, (df2-df3)/(df1-1), df3 )
x y
0 -0.671787 -0.445276
1 -0.609351 -0.881987
2 0.324390 1.222632
3 -0.138606 0.955993
Note that this should work on both series and dataframes. The [:] is slicing syntax that preserves the index and columns. Without that res will come out as an array rather than series or dataframe.
Alternatively, for a series you could write as #Kadir does in his answer:
res = pd.Series(np.where( df1>1, (df2-df3)/(df1-1), df3 ), index=df1.index)
Or similarly for a dataframe you could write:
res = pd.DataFrame(np.where( df1>1, (df2-df3)/(df1-1), df3 ), index=df1.index,
columns=df1.columns)
Integrating the idea in this question into JohnE's answer, I have come up with this solution:
res = pd.Series(np.where( df1 > 1, (df2-df3)/(df1-1), df3 ), index=df1.index)
A better answer using DataFrames will be appreciated.
Say df is your initial dataframe and res is the new column. Use a combination of setting values and boolean indexing.
Set res to be a copy of df3:
df['res'] = df['df3']
Then adjust values for your condition.
df[df['df1']>1.0]['res'] = (df['df2'] - df['df3'])/(df['df1']-1)

Assign dataframes in a list to a list of names; pandas

I have a variable
var=[name1,name2]
I have a dataframe also in a list
df= [df1, df2]
How do i assign df1 to name1 and df2 to name2 and so on.
If I understand correctly, assuming the lengths of both lists are the same you just iterate over the indices of both lists and just assign them, example:
In [412]:
name1,name2 = None,None
var=[name1,name2]
df1, df2 = 1,2
df= [df1, df2]
​
for x in range(len(var)):
var[x] = df[x]
var
Out[412]:
[1, 2]
If your variable list is storing strings then I would not make variables from those strings (see How do I create a variable number of variables?) and instead create a dict:
In [414]:
var=['name1','name2']
df1, df2 = 1,2
df= [df1, df2]
d = dict(zip(var,df))
d
Out[414]:
{'name1': 1, 'name2': 2}
To answer your question, you can do this by:
for i in zip(var, df):
globals()[i[0]] = i[1]
And then access your variables.
But proceeding this way is bad. You're like launching a dog in your global environment. It's better to keep control about what you handle, keep your dataframe in a list or dictionary.