.pkl data as dictionary to dataframe - dataframe

I'm trying to convert a dictionary that I have obtained from a pickle file to dataframe. A sample of it is below:
with open('sample.pkl', 'rb') as f:
test = pk.load(f)
Output: {0: 23, 1: 34, 2: 65, 3: 12, 4: 87, 5: 55}
I want this dictionary as a dataframe, where my keys and values are under two separate columns of the df, ['key', 'value']. For that I have tried the following:
Method 1:
df = pd.DataFrame(data, columns=['keys', 'value'])
Output: empty dataframe
Index: []
Method 2:
df = pd.DataFrame.from_dict(data, orient='index', columns=['value'])
Output obtained:
value
**0** 66
**1** 40
**2 ** 81
**3** 39
**4** 57
**5** 82
Output required:
keys value
0 66
1 40
2 81
3 39
4 57
5 82
I want to remove the default index column, and save my keys under the column named as 'keys'. Later on append this dataframe to an existing csv.

Related

Fill NaN values of a single column where matching values are found in another dataframe [duplicate]

This question already has answers here:
Can I replace Nans with the mode of a column in a grouped data frame?
(2 answers)
Pandas merge dataframes with shared column, fillna in left with right
(4 answers)
Closed 2 months ago.
I have two data frames: 'df' and 'dfcup'.
'df' has NaN values in some rows for the variable "Cup"
My second data frame 'dfcup' is a .groupby() and .sort() call on the 'df' to produce a 'count' of the number of times a "Height","Weight" and "Cup" combination appear:
dfcup = df.groupby(['Height','Weight','Cup']).size().reset_index().rename(columns={0:'count'})
The goal is to replace NaN values in the df['Cup'] column with the first matching dfcup['Cup'] based on "Height" and "Weight".
For example, for a given "Height" and "Weight" of "65.0" and "130.0", replace NaN values in df['Cup'] with "B" (as found in dfcup['Cup']
Is this possible with a "for" loop, or something similar?
Example
i make minimal and reproducible example, becuz you dont provide minimal and reproducible code. This example is different from yours so use only logic. you don need make df_cup to solve it.
data = {'c1': {0: 64, 1: 63, 2: 64, 3: 63, 4: 63, 5: 64, 6: 63, 7: 64},
'c2': {0: 'A', 1: 'D', 2: 'B', 3: None, 4: 'C', 5: None, 6: 'C', 7: 'B'}}
df = pd.DataFrame(data)
df
c1 c2
0 64 A
1 63 D
2 64 B
3 63 None
4 63 C
5 64 None
6 63 C
7 64 B
when c1 is 64, B is most in c2, when c1 is 63, C is most in c2
Code
df.assign(c2=df.groupby('c1')['c2'].transform(lambda x: x.fillna(x.mode()[0])))
output:
c1 c2
0 64 A
1 63 D
2 64 B
3 63 C <-- None of 63 filled by C (most of 63)
4 63 C
5 64 B <-- None of 64 filled by B (most of 64)
6 63 C
7 64 B
Use this logic to solve your problem

Concatenate/Append many dataframes in pandas

I have a list of dataframes df1 to df20 that are being created from a loop and I need to concatenate all of them in one go. These dataframes are dynamic and can be any number between 1 to 20 as per the loop that generates in my code.
So, I was trying to create an empty list first and add these dataframe names to it (in a loop for 1 to 20 as example) and to use this list in pd.concat(df_list) as below:
df_list=[]
for in in range(1,21):
df_list.append(f'df{i}')
pd.concat(df_list)
the above code is creating list of dataframe names but in the form of string with quotes like below and I'm unable to concatenate the dataframes using the pd.concat(df_list) as it's considering all the dataframe names as string elements
print(df_list)
['df1', 'df2', 'df3', 'df4', 'df5', 'df6', 'df7', 'df8', 'df9', 'df10', 'df11', 'df12', 'df13', 'df14', 'df15', 'df16', 'df17', 'df18','df19','df20']
Appreciate if anyone can help me in getting this concatenation of dataframes.
I think if I can add the dataframes names without quotes, like df_list=[df0,df1,df2...] then the pd.concat can work or else please let me know if there is any best alternative to get this done. Thanks!
UPDATE
As per commented suggestions, I've created a simple loop to create multiple dataframes and then I tried to append the "names of these dataframes" to an empty list in this loop itself where these dataframes are getting created. But, the o/p is not what am I expecting.
mylist=[]
for x in range(1,4):
globals()[f"df{i}"]=pd.DataFrame(np.random.randint(99,size=(3,3)),columns=['AA','BB','CC'])
mylist.append(globals()[f"df{i}"])
The above code creates 3 dataframes df1,df2 and df3 and also the empty list is getting appended but with the contents of dataframes as shown below
print(mylist)
[ AA BB CC
0 57 92 50
1 33 47 28
2 82 77 46, AA BB CC
0 18 8 75
1 1 15 52
2 4 69 38, AA BB CC
0 19 24 31
1 24 52 62
2 50 8 63]
But, my desired output is not the contents of the dataframes, but the names of the dataframes themselves like below.
print(mylist)
[df1,df2,df3]
Appreciate if anyone can show me how to get this done. I think there must be some simple way to do this.
That's because you're efectively appending strings to your list. If you have named variables, df1 to df20, you can access them by using locals() (or globals() depending on where your named variables are, and if you are concatenating the dataframes in a function or not). Here is an example,
df1 = 0
df2 = 1
df3 = 2
df_list = []
for i in range(1, 4):
df_list.append(locals()[f'df{i}'])
>>> df_list
[0, 1, 2]
EDIT: I think what you want to do is the following:
import pandas as pd
import numpy as np
mylist = []
for x in range(1, 4):
df = pd.DataFrame(np.random.randint(99, size=(3,3)), columns=['AA','BB','CC'])
mylist.append(df)
dfs = pd.concat(mylist)
Note that printing mylist is never going to tell you something along the lines of mylist = [df1, df2, df3], even if you hardcode that. That will print the entire content of all the variables inside your list. If you don't know how many dataframes you're going to concatenate for some reason, then just implement a while loop that breaks when you want to stop creating dataframes.
Consider another example
# create a list of 100 dataframes (df0 to df99)
mylist = []
for x in range(100):
df = pd.DataFrame(np.random.randint(99,size=(3,3)), columns=['AA','BB','CC'])
mylist.append(df)
concat_range = input("Range of dataframes to concatenate (0-100): ")
i, j = concat_range.split(" ")
dfs = pd.concat(mylist[int(i) : int(j)])
# further operations on dfs
Now, let's say I am the user and I want to concatenate df5 to df32.
>>> Range of dataframes to concatenate (0-100): 5 32
>>> dfs
AA BB CC
0 28 37 36
1 34 18 14
2 39 41 97
0 44 66 76
1 57 16 3
.. .. .. ..
1 43 87 74
2 67 70 73
0 40 60 57
1 23 63 70
2 96 24 31
[81 rows x 3 columns]

Getting variable no of pandas rows w.r.t. a dictionary lookup

In this sample dataframe df:
import pandas as pd
import numpy as np
import random, string
max_rows = {'A': 3, 'B': 2, 'D': 4} # max number of rows to be extracted
data_size = 1000
df = pd.DataFrame({'symbol': pd.Series(random.choice(string.ascii_uppercase) for _ in range(data_size)),
'qty': np.random.randn(data_size)}).sort_values('symbol')
How to get a dataframe with variable rows from a dictionary?
Tried using [df.groupby('symbol').head(i) for i in df.symbol.map(max_rows)]. It gives a RuntimeWarning and looks very incorrect.
You can use concat with list comprehension:
print (pd.concat([df.loc[df["symbol"].eq(k)].head(v) for k,v in max_rows.items()]))
symbol qty
640 A -0.725947
22 A -1.361063
190 A -0.596261
451 B -0.992223
489 B -2.014979
593 D 1.581863
600 D -2.162044
793 D -1.162758
738 D 0.345683
Adding another method using groupby+cumcount and df.query
df.assign(v=df.groupby("symbol").cumcount()+1,k=df['symbol'].map(max_rows)).query("v<=k")
Or same logic without assigning extra columns #thanks #jezrael
df[df.groupby("symbol").cumcount()+1 <= df['symbol'].map(max_rows)]
symbol qty
882 A -0.249236
27 A 0.625584
122 A -1.154539
229 B -1.269212
55 B 1.403455
457 D -2.592831
449 D -0.433731
634 D 0.099493
734 D -1.551012

Condition in Pandas

I have a very peculiar problem in Pandas: one condition works but the other does not. You may download the linked file to test my code. Thanks!
I have a file (stars.txt) that I read in with Pandas. I would like to create two groups: (1) with Log_g < 4.0 and (2) Log_g > 4.0. In my code (see below) I can successfully get rows for group (1):
Kepler_ID RA Dec Teff Log_G g H
3 2305372 19 27 57.679 +37 40 21.90 5664 3.974 14.341 12.201
14 2708156 19 21 08.906 +37 56 11.44 11061 3.717 10.672 10.525
19 2997455 19 32 31.296 +38 07 40.04 4795 3.167 14.694 11.500
34 3352751 19 36 17.249 +38 25 36.91 7909 3.791 13.541 12.304
36 3440230 19 21 53.100 +38 31 42.82 7869 3.657 13.706 12.486
But for some reason I cannot get (2). The code returns the following for error:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 90 entries, 0 to 108
Data columns (total 7 columns):
Kepler_ID 90 non-null values
RA 90 non-null values
Dec 90 non-null values
Teff 90 non-null values
Log_G 90 non-null values
g 90 non-null values
H 90 non-null values
dtypes: float64(4), int64(1), object(2)
Here's my code:
#------------------------------------------
# IMPORT STATEMENTS
#------------------------------------------
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
#------------------------------------------
# READ FILE AND ASSOCIATE COMPONENTS
#------------------------------------------
star_file = 'stars.txt'
header_row = ['Kepler_ID', 'RA','Dec','Teff', 'Log_G', 'g', 'H']
df = pd.read_csv(star_file, names=header_row, skiprows=2)
#------------------------------------------
# ASSOCIATE VARIABLES
#------------------------------------------
Kepler_ID = df['Kepler_ID']
#RA = df['RA']
#Dec = df['Dec']
Teff = df['Teff']
Log_G = df['Log_G']
g = df['g']
H = df['H']
#------------------------------------------
# SUBSTITUTE MISSING DATA WITH NAN
#------------------------------------------
df = df.replace('', np.nan)
#------------------------------------------
# CHANGE DATA TYPE OF THE REST OF DATA TO FLOAT
#------------------------------------------
df[['Teff', 'Log_G', 'g', 'H']] = df[['Teff', 'Log_G', 'g', 'H']].astype(float)
#------------------------------------------
# SORTING SPECTRA TYPES FOR GIANTS
#------------------------------------------
# FIND GIANTS IN THE SAMPLE
giants = df[(df['Log_G'] < 4.)]
#print giants
# FIND GIANTS IN THE SAMPLE
dwarfs = df[(df['Log_G'] > 4.)]
print dwarfs
This is not an error. You are seeing a summarized view of the DataFrame:
In [11]: df = pd.DataFrame([[2, 1], [3, 4]])
In [12]: df
Out[12]:
0 1
0 2 1
1 3 4
In [13]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 2 columns):
0 2 non-null values
1 2 non-null values
dtypes: int64(2)
Which is displayed is decided by several display package options, for example, max_rows:
In [14]: pd.options.display.max_rows
Out[14]: 60
In [15]: pd.options.display.max_rows = 120
In 0.13, this behaviour changed, so you will see the first max_rows followed by ....

drop a single tuple from a multi tuple column

I have the following dataframe:
<bound method DataFrame.info of <class 'pandas.core.frame.DataFrame'>
MultiIndex: 369416 entries, (datetime.datetime(2008, 1, 2, 16, 0), 'ABC') to (datetime.datetime(2010, 12, 31, 16, 0), 'XYZ')
Data columns:
b_val 369416 non-null values
dtypes: float64(1)>
From this, I want a dataframe that has dates as the indexes and 'ABC' to 'XYZ' as column names with the values as the values under the column 'b_val'. I tried to do:
new_data = new_data.unstack()
But this gives me:
<bound method DataFrame.info of <class 'pandas.core.frame.DataFrame'>
Index: 757 entries, 2008-01-02 16:00:00 to 2010-12-31 16:00:00
Columns: 488 entries, ('b_val', 'ABC') to ('b_val', 'XYZ')
dtypes: float64(488)>
Is there a way to transform this another way or is there a way to drop 'b_val' from each of the column names?
I think unstack is the correct way to do what you've done.
You could drop the first level from the the column names (a MultiIndex) using droplevel:
df.columns = df.columns.droplevel(0)
Here's an example:
df = pd.DataFrame([[1, 'a', 22], [1, 'b', 27], [2, 'a', 35], [2, 'b', 56]], columns=['date', 'name', 'value']).set_index(['date','name'])
df1 = df.unstack()
In [3]: df1
Out[3]:
value
name a b
date
1 22 27
2 35 56
In [4]: df1.columns = df1.columns.droplevel(0)
In [5]: df1
Out[5]:
name a b
date
1 22 27
2 35 56
However, a cleaner option is just to unstack the column (the series):
In [6]: df.value.unstack()
Out[6]:
name a b
date
1 22 27
2 35 56