I have a dataframe like this:
import pandas as pd
import numpy as np
data={'UNIT':['UNIT_1','UNIT_2','UNIT_3','UNIT_4'],
'Name_1':[ 'werner', 'otto', 'karl', 'fritz'],
'Name_2':[ 'ottilie', 'anna', 'jasmin', ''],
'Name_3':[ 'bello', 'kitti', '', '']}
df=pd.DataFrame(data)
df.replace('', np.nan, inplace=True)
display(df)
which looks like this:
The result that I want is this:
The code I have so far looks like this:
for index, row in df.iterrows():
row_transposed = row.T
row_transposed.dropna(inplace=True)
df_row_transposed = pd.DataFrame(row_transposed)
df_row_transposed_head = df_row_transposed.head(1)
#display(df_row_transposed)
#display(row_transposed_head)
hr_unit = df_row_transposed_head.iloc[0]
add_unit = (hr_unit[index])
for index, row in df_row_transposed.iterrows():
df_row_transposed["UNIT"] = add_unit
#row_transposed = row_transposed.iloc[index: , :]
display(df_row_transposed)
which already creates this:
but now I am stuck...
Any help is very much appriciated
Try df.melt. It will help you to unstack the column.
ddf = df.melt(id_vars='UNIT').sort_values(by='UNIT')
new_df = ddf[["value","UNIT"]]
new_df.dropna().reset_index(drop=True,inplace=True)
new_df
Out[163]:
value UNIT
0 werner UNIT_1
1 ottilie UNIT_1
2 bello UNIT_1
3 otto UNIT_2
4 anna UNIT_2
5 kitti UNIT_2
6 karl UNIT_3
7 jasmin UNIT_3
8 fritz UNIT_4
Related
I'm trying to drop rows with missing values in any of several dataframes.
They all have the same number of rows, so I tried this:
model_data_with_NA = pd.concat([other_df,
standardized_numerical_data,
encode_categorical_data], axis=1)
ok_rows = ~(model_data_with_NA.isna().all(axis=1))
model_data = model_data_with_NA.dropna()
assert(sum(ok_rows) == len(model_data))
False!
As a newbie in Python, I wonder why this doesn't work? Also, is it better to use hierarchical indexing? Then I can extract the original columns from model_data.
In Short
I believe the all in ~(model_data_with_NA.isna().all(axis=1)) should be replaced with any.
The reason is that all checks here if every value in a row is missing, and any checks if one of the values is missing.
Full Example
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'a':[1, 2, 3]})
df2 = pd.DataFrame({'b':[1, np.nan]})
df3 = pd.DataFrame({'c': [1, 2, np.nan]})
model_data_with_na = pd.concat([df1, df2, df3], axis=1)
ok_rows = ~(model_data_with_na.isna().any(axis=1))
model_data = model_data_with_na.dropna()
assert(sum(ok_rows) == len(model_data))
model_data_with_na
a
b
c
0
1
1
1
1
2
nan
2
2
3
nan
nan
model_data
a
b
c
0
1
1
1
1,100
2,200
3,300
...
many datas
...
9934,321
9935,111
2021-01-01, jane doe, 321
2021-01-10, john doe, 211
2021-01-30, jack doe, 911
...
many datas
...
2021-11-30, jick doe, 921
If I meet csv file like above,
How can I separate it as 2 dataframes? without loop or something calculate
I see this like that:
import pandas as pd
data = 'file.csv'
df = pd.read_csv(data ,names=['a', 'b', 'c']) # I have to name columns
df_1 = df[~df['c'].isnull()] #This is with 3rd column
df_2 = df[df['c'].isnull()] #This is where are only two columns
Second idea was to first find the index of the row where data will switch from 2 to 3 column.
import pandas as pd
import numpy as np
data = 'stack.csv'
df = pd.read_csv(data ,names=['a', 'b', 'c'])
rows = df['c'].index[df['c'].apply(np.isnan)]
df_1 = pd.read_csv(data ,names=['a', 'b','c'],skiprows=rows[-1]+1)
df_2 = pd.read_csv(data ,names=['a', 'b'],nrows = rows[-1]+1)
I think you can easily modify the code when the files will change.
Here is the reason why I named columns link
I have a data set like this : {'IT',[1,20,35,44,51,....,1000]}
I want to convert this into python/pandas data frame. I want to see output in the below format. How to achieve this output.
Dept Count
IT 1
IT 20
IT 35
IT 44
IT 51
.. .
.. .
.. .
IT 1000
Below way i can write, but this is not efficient way for huge data.
data = [['IT',1],['IT',2],['IT',3]]
df = pd.DataFrame(data,columns=['Dept','Count'])
print(df)
No need for a list comprehension since pandas will automatically fill IT in for every row.
import pandas as pd
d = {'IT':[1,20,35,44,51,1000]}
df = pd.DataFrame({'dept': 'IT', 'count': d['IT']})
Use list comprehension for tuples and pass to DataFrame constructor:
d = {'IT':[1,20,35,44,51], 'NEW':[1000]}
data = [(k, x) for k, v in d.items() for x in v]
df = pd.DataFrame(data,columns=['Dept','Count'])
print(df)
Dept Count
0 IT 1
1 IT 20
2 IT 35
3 IT 44
4 IT 51
5 NEW 1000
You can use melt
import pandas as pd
d = {'IT': [10]*100000}
df = pd.DataFrame(d)
df = pd.melt(df, var_name='Dept', value_name='Count')
it is possible to do in Pandas dataframe the equivalent of this SQL code
delete * from tableA where id in (select id from tableB)
Don't know the exact structure of your DataFrames, but something like this should do it:
# Setup dummy data:
import pandas as pd
tableA = pd.DataFrame(data={"id":[1, 2, 3]})
tableB = pd.DataFrame(data={"id":[3, 4, 5]})
# Solution:
tableA = tableA[~tableA["id"].isin(tableB["id"])]
Yes there is an equivalent. You can try this:
df2.drop(df2.loc[df2['id'].isin(df1['id'])].index)
For example:
df1 = pd.DataFrame({'id': [1,2,3,4], 'value': [2,3,4,5]})
df2 = pd.DataFrame({'id': [1,2,3,4, 5,6,7], 'col': [2,3,4,5,6,7,8]})
print(df2.drop(df2.loc[df2['id'].isin(df1['id'])].index))
output:
id col
4 5 6
5 6 7
6 7 8
I just took random example dataframe. This example is dropping values from df2 (which you can say TableA) using df1 (or TableB)
I import data from Excel into python pandas with read_clipboard.
import pandas as pd
df = pd.read_clipboard()
The column index are the month (januar, februar, ...,december). The row index are products name (orange, banana, etc). And the value in cells are the monthly sales.
How can I export a csv of the following format
month;product;sales
To make it more visual, I show the input in the first image and how the output shoud be in the second image.
You can also use xlrd package.
Sample Book1.xlsx:
january february march
Orange 4 2 4
banana 2 6 3
apple 5 1 7
sample code:
import xlrd
book = xlrd.open_workbook("Book1.xlsx")
print(book.sheet_names())
first_sheet = book.sheet_by_index(0)
row1 = first_sheet.row_values(0)
print(first_sheet.nrows)
for i in range(len(row1)):
if i !=0:
next_row = first_sheet.row_values(i)
for j in range(len(next_row)-1):
print("{};{};{}".format(row1[i],next_row[0],next_row[j+1]))
Result:
january;Orange;4.0
january;Orange;2.0
january;Orange;4.0
february;banana;2.0
february;banana;6.0
february;banana;3.0
march;apple;5.0
march;apple;1.0
march;apple;7.0
If that is only the case, it might solve that problem:
month = df1.columns.to_list()*3
product = []
sales=[]
for x in range(0,2):
product += [df1.index[x]]*12
sales += df1.iloc[x].values.tolist()
df2 = pd.DataFrame({'month': month, 'product': product, 'sales': sales})
But you need to look for smarter way if you have a larger Dataframe, like what #Jon Clements suggested in the comment.
I finally solved it thanks to your advice : using unstack
df2 = df.transpose()
df3 = df2 =.unstack()
df3.to_csv('my/path/name.csv', sep=';')