Transform Dataframe rows to columns and additional steps - dataframe

I have a dataframe like this:
import pandas as pd
import numpy as np
data={'UNIT':['UNIT_1','UNIT_2','UNIT_3','UNIT_4'],
'Name_1':[ 'werner', 'otto', 'karl', 'fritz'],
'Name_2':[ 'ottilie', 'anna', 'jasmin', ''],
'Name_3':[ 'bello', 'kitti', '', '']}
df=pd.DataFrame(data)
df.replace('', np.nan, inplace=True)
display(df)
which looks like this:
The result that I want is this:
The code I have so far looks like this:
for index, row in df.iterrows():
row_transposed = row.T
row_transposed.dropna(inplace=True)
df_row_transposed = pd.DataFrame(row_transposed)
df_row_transposed_head = df_row_transposed.head(1)
#display(df_row_transposed)
#display(row_transposed_head)
hr_unit = df_row_transposed_head.iloc[0]
add_unit = (hr_unit[index])
for index, row in df_row_transposed.iterrows():
df_row_transposed["UNIT"] = add_unit
#row_transposed = row_transposed.iloc[index: , :]
display(df_row_transposed)
which already creates this:
but now I am stuck...
Any help is very much appriciated

Try df.melt. It will help you to unstack the column.
ddf = df.melt(id_vars='UNIT').sort_values(by='UNIT')
new_df = ddf[["value","UNIT"]]
new_df.dropna().reset_index(drop=True,inplace=True)
new_df
Out[163]:
value UNIT
0 werner UNIT_1
1 ottilie UNIT_1
2 bello UNIT_1
3 otto UNIT_2
4 anna UNIT_2
5 kitti UNIT_2
6 karl UNIT_3
7 jasmin UNIT_3
8 fritz UNIT_4

Related

drop rows from a Pandas dataframe based on which rows have missing values in another dataframe

I'm trying to drop rows with missing values in any of several dataframes.
They all have the same number of rows, so I tried this:
model_data_with_NA = pd.concat([other_df,
standardized_numerical_data,
encode_categorical_data], axis=1)
ok_rows = ~(model_data_with_NA.isna().all(axis=1))
model_data = model_data_with_NA.dropna()
assert(sum(ok_rows) == len(model_data))
False!
As a newbie in Python, I wonder why this doesn't work? Also, is it better to use hierarchical indexing? Then I can extract the original columns from model_data.
In Short
I believe the all in ~(model_data_with_NA.isna().all(axis=1)) should be replaced with any.
The reason is that all checks here if every value in a row is missing, and any checks if one of the values is missing.
Full Example
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'a':[1, 2, 3]})
df2 = pd.DataFrame({'b':[1, np.nan]})
df3 = pd.DataFrame({'c': [1, 2, np.nan]})
model_data_with_na = pd.concat([df1, df2, df3], axis=1)
ok_rows = ~(model_data_with_na.isna().any(axis=1))
model_data = model_data_with_na.dropna()
assert(sum(ok_rows) == len(model_data))
model_data_with_na
a
b
c
0
1
1
1
1
2
nan
2
2
3
nan
nan
model_data
a
b
c
0
1
1
1

what is the smartest way to read csv with multiple type data mixed in?

1,100
2,200
3,300
...
many datas
...
9934,321
9935,111
2021-01-01, jane doe, 321
2021-01-10, john doe, 211
2021-01-30, jack doe, 911
...
many datas
...
2021-11-30, jick doe, 921
If I meet csv file like above,
How can I separate it as 2 dataframes? without loop or something calculate
I see this like that:
import pandas as pd
data = 'file.csv'
df = pd.read_csv(data ,names=['a', 'b', 'c']) # I have to name columns
df_1 = df[~df['c'].isnull()] #This is with 3rd column
df_2 = df[df['c'].isnull()] #This is where are only two columns
Second idea was to first find the index of the row where data will switch from 2 to 3 column.
import pandas as pd
import numpy as np
data = 'stack.csv'
df = pd.read_csv(data ,names=['a', 'b', 'c'])
rows = df['c'].index[df['c'].apply(np.isnan)]
df_1 = pd.read_csv(data ,names=['a', 'b','c'],skiprows=rows[-1]+1)
df_2 = pd.read_csv(data ,names=['a', 'b'],nrows = rows[-1]+1)
I think you can easily modify the code when the files will change.
Here is the reason why I named columns link

Pandas data frame creation using static data

I have a data set like this : {'IT',[1,20,35,44,51,....,1000]}
I want to convert this into python/pandas data frame. I want to see output in the below format. How to achieve this output.
Dept Count
IT 1
IT 20
IT 35
IT 44
IT 51
.. .
.. .
.. .
IT 1000
Below way i can write, but this is not efficient way for huge data.
data = [['IT',1],['IT',2],['IT',3]]
df = pd.DataFrame(data,columns=['Dept','Count'])
print(df)
No need for a list comprehension since pandas will automatically fill IT in for every row.
import pandas as pd
d = {'IT':[1,20,35,44,51,1000]}
df = pd.DataFrame({'dept': 'IT', 'count': d['IT']})
Use list comprehension for tuples and pass to DataFrame constructor:
d = {'IT':[1,20,35,44,51], 'NEW':[1000]}
data = [(k, x) for k, v in d.items() for x in v]
df = pd.DataFrame(data,columns=['Dept','Count'])
print(df)
Dept Count
0 IT 1
1 IT 20
2 IT 35
3 IT 44
4 IT 51
5 NEW 1000
You can use melt
import pandas as pd
d = {'IT': [10]*100000}
df = pd.DataFrame(d)
df = pd.melt(df, var_name='Dept', value_name='Count')

is it possible to do the equivalent of SQL nested requests in pandas dataframe?

it is possible to do in Pandas dataframe the equivalent of this SQL code
delete * from tableA where id in (select id from tableB)
Don't know the exact structure of your DataFrames, but something like this should do it:
# Setup dummy data:
import pandas as pd
tableA = pd.DataFrame(data={"id":[1, 2, 3]})
tableB = pd.DataFrame(data={"id":[3, 4, 5]})
# Solution:
tableA = tableA[~tableA["id"].isin(tableB["id"])]
Yes there is an equivalent. You can try this:
df2.drop(df2.loc[df2['id'].isin(df1['id'])].index)
For example:
df1 = pd.DataFrame({'id': [1,2,3,4], 'value': [2,3,4,5]})
df2 = pd.DataFrame({'id': [1,2,3,4, 5,6,7], 'col': [2,3,4,5,6,7,8]})
print(df2.drop(df2.loc[df2['id'].isin(df1['id'])].index))
output:
id col
4 5 6
5 6 7
6 7 8
I just took random example dataframe. This example is dropping values from df2 (which you can say TableA) using df1 (or TableB)

pandas dataframe export row column value

I import data from Excel into python pandas with read_clipboard.
import pandas as pd
df = pd.read_clipboard()
The column index are the month (januar, februar, ...,december). The row index are products name (orange, banana, etc). And the value in cells are the monthly sales.
How can I export a csv of the following format
month;product;sales
To make it more visual, I show the input in the first image and how the output shoud be in the second image.
You can also use xlrd package.
Sample Book1.xlsx:
january february march
Orange 4 2 4
banana 2 6 3
apple 5 1 7
sample code:
import xlrd
book = xlrd.open_workbook("Book1.xlsx")
print(book.sheet_names())
first_sheet = book.sheet_by_index(0)
row1 = first_sheet.row_values(0)
print(first_sheet.nrows)
for i in range(len(row1)):
if i !=0:
next_row = first_sheet.row_values(i)
for j in range(len(next_row)-1):
print("{};{};{}".format(row1[i],next_row[0],next_row[j+1]))
Result:
january;Orange;4.0
january;Orange;2.0
january;Orange;4.0
february;banana;2.0
february;banana;6.0
february;banana;3.0
march;apple;5.0
march;apple;1.0
march;apple;7.0
If that is only the case, it might solve that problem:
month = df1.columns.to_list()*3
product = []
sales=[]
for x in range(0,2):
product += [df1.index[x]]*12
sales += df1.iloc[x].values.tolist()
df2 = pd.DataFrame({'month': month, 'product': product, 'sales': sales})
But you need to look for smarter way if you have a larger Dataframe, like what #Jon Clements suggested in the comment.
I finally solved it thanks to your advice : using unstack
df2 = df.transpose()
df3 = df2 =.unstack()
df3.to_csv('my/path/name.csv', sep=';')