How to apply str.split() on pandas column? - pandas

Using Simple Data:
df = pd.DataFrame({'ids': [0,1,2], 'value': ['2 4 10 0 14', '5 91 19 20 0', '1 1 1 2 44']})
I need to convert the column to array, so I use:
df.iloc[:,-1] = df.iloc[:,-1].apply(lambda x: str(x).split())
X = df.iloc[:, 1:]
X = np.array(X.values)
but the problem is the data is being nested and I just need a matrix (3,5). How to make this properly and fast for large data (avoid looping)?

As said in the comments by #anky, #ScottBoston. You can use string method split along with expand parameter and finally change to NumPy:
df.iloc[:, 1].str.split(expand=True).values
array([['2', '4', '10', '0', '14'],
['5', '91', '19', '20', '0'],
['1', '1', '1', '2', '44']], dtype=object)

Related

Assign Random Number between two value conditionally

I have a dataframe:
df = pd.DataFrame({
'Prod': ['abc', 'qrt', 'xyz', 'xam', 'asc', 'yat'],
'Line': ['Revenues','EBT', 'Expenses', 'Revenues', 'EBT', 'Expenses'],
'AGE': ['35-44', '20-34', '35-44', '35-44', '45-70', '35-44']})
I wish to replace the values in the Age column by integers between two values. So, for example, I wish to replace each value with age range '35-44' by a random integer between 35-44.
I tried:
df.loc[df["AGE"]== '35-44', 'AGE'] = random.randint(35, 44)
But it picks the same value for each row. I would like it to randomly pick a different value for each row.
I get:
df = pd.DataFrame({
'Prod': ['abc', 'qrt', 'xyz', 'xam', 'asc', 'yat'],
'Line': ['Revenues','EBT', 'Expenses', 'Revenues', 'EBT', 'Expenses'],
'AGE': ['38', '20-34', '38', '38', '45-70', '38']})
But I would like to get something like the following. I don't much care about how the values are distributed as long as they are in the range that I assign
df = pd.DataFrame({
'Prod': ['abc', 'qrt', 'xyz', 'xam', 'asc', 'yat'],
'Line': ['Revenues','EBT', 'Expenses', 'Revenues', 'EBT', 'Expenses'],
'AGE': ['36', '20-34', '39', '38', '45-70', '45']})
The code
random.randint(35, 44)
Produces a single random value making the statement analogous to:
df.loc[df["AGE"]== '35-44', 'AGE'] = 38 # some constant
We need a collection of values that is the same length as the values to fill. We can use np.random.randint instead:
import numpy as np
m = df["AGE"] == '35-44'
df.loc[m, 'AGE'] = np.random.randint(35, 44, m.sum())
(Series.sum is used to "count" the number of True values in the Series since True is 1 and False is 0)
df:
Prod Line AGE
0 abc Revenues 40
1 qrt EBT 20-34
2 xyz Expenses 41
3 xam Revenues 35
4 asc EBT 45-70
5 yat Expenses 36
*Reproducible with np.random.seed(26)
Naturally, using the filter on both sides of the expression with apply would also work:
import random
m = df["AGE"] == '35-44'
df.loc[m, 'AGE'] = df.loc[m, 'AGE'].apply(lambda _: random.randint(35, 44))
df:
Prod Line AGE
0 abc Revenues 36
1 qrt EBT 20-34
2 xyz Expenses 37
3 xam Revenues 43
4 asc EBT 45-70
5 yat Expenses 44
*Reproducible with random.seed(28)

convert pandas dataframe to list and nest a dict?

I have a list:
l = [{'level': '1', 'rows': 2}, {'level': '2', 'rows': 3}]
I can conert to DataFrame, but how do I convert back?
frame = pd.DataFrame(l)
We have to_dict
frame.to_dict('r')
Out[67]: [{'level': '1', 'rows': 2}, {'level': '2', 'rows': 3}]

Combine index header row and column header row in Pandas

I create a dataframe and export to an html table. However the headers are off as below
How can I combine the index name row, and the column name row?
I want the table header to look like this:
but it currently exports to html like this:
I create the dataframe as below (example):
data = [{'Name': 'A', 'status': 'ok', 'host': '1', 'time1': '2020-01-06 06:31:06', 'time2': '2020-02-06 21:10:00'}, {'Name': 'A', 'status': 'ok', 'host': '2', 'time1': '2020-01-06 06:31:06', 'time2': '-'}, {'Name': 'B', 'status': 'Alert', 'host': '1', 'time1': '2020-01-06 10:31:06', 'time2': '2020-02-06 21:10:00'}, {'Name': 'B', 'status': 'ok', 'host': '2', 'time1': '2020-01-06 10:31:06', 'time2': '2020-02-06 21:10:00'},{'Name': 'B', 'status': 'ok', 'host': '4', 'time1': '2020-01-06 10:31:06', 'time2': '2020-02-06 21:10:00'},{'Name': 'C', 'status': 'Alert', 'host': '2', 'time1': '2020-01-06 10:31:06', 'time2': '2020-02-06 21:10:00'},{'Name': 'C', 'status': 'ok', 'host': '3', 'time1': '2020-01-06 10:31:06', 'time2': '2020-02-06 21:10:00'},{'Name': 'C', 'status': 'ok', 'host': '4', 'time1': '-', 'time2': '-'}]
df = pandas.DataFrame(data)
df.set_index(['Name', 'status', 'host'], inplace=True)
html_body = df.to_html(bold_rows=False)
The index is set to have hierarchical rows, for easier reading in an html table:
print(df)
time1 time2
Name status host
A ok 1 2020-01-06 06:31:06 2020-02-06 21:10:00
2 2020-01-06 06:31:06 -
B Alert 1 2020-01-06 10:31:06 2020-02-06 21:10:00
ok 2 2020-01-06 10:31:06 2020-02-06 21:10:00
4 2020-01-06 10:31:06 2020-02-06 21:10:00
C Alert 2 2020-01-06 10:31:06 2020-02-06 21:10:00
ok 3 2020-01-06 10:31:06 2020-02-06 21:10:00
4 - -
The only solution that I've got working is to set every column to index.
This doesn't seem practical tho, and leaves an empty row that must be manually removed:
Setup
import pandas as pd
from IPython.display import HTML
l0 = ('Foo', 'Bar')
l1 = ('One', 'Two')
ix = pd.MultiIndex.from_product([l0, l1], names=('L0', 'L1'))
df = pd.DataFrame(1, ix, [*'WXYZ'])
HTML(df.to_html())
BeautifulSoup
Hack the HTML result from df.to_html(header=False). Pluck out the empty cells in the table head and drop in the column names.
from bs4 import BeautifulSoup
html_doc = df.to_html(header=False)
soup = BeautifulSoup(html_doc, 'html.parser')
empty_cols = soup.find('thead').find_all(lambda tag: not tag.contents)
for tag, col in zip(empty_cols, df):
tag.string = col
HTML(soup.decode_contents())
If you want to use a Dataframe Styler to perform a lot of wonderful formatting on your table, the elements, and the contents, then you might need a slight change to piRSquared's answer, as I did.
before transformation
style.to_html() added non-breaking spaces which made tag.contents always return true, and thus yielded no change to the table. I modified the lambda to account for this, which revealed another issue.
lambda tag: (not tag.contents) or '\xa0' in tag.contents
Cells were copied strangely
Styler.to_html() lacks the header kwarg - I am guessing this is the source of the issue. I took a slightly different approach - Move the second row headers into the first row, and then destroy the second header row.
It seems pretty generic and reusable for any multi-indexed dataframe.
df_styler = summary_df.style
# Use the df_styler to change display format, color, alignment, etc.
raw_html = df_styler.to_html()
soup = BeautifulSoup(raw_html,'html.parser')
head = soup.find('thead')
trs = head.find_all('tr')
ths0 = trs[0].find_all(lambda tag: (not tag.contents) or '\xa0' in tag.contents)
ths1 = trs[1].find_all(lambda tag: (tag.contents) or '\xa0' not in tag.contents)
for blank, filled in zip(ths0, ths1):
blank.replace_with(filled)
trs[1].decompose()
final_html_str = soup.decode_contents()
Success - two header rows condensed into one
Big Thanks to piRSquared for the starting point of Beautiful soup!

Conditional grouping in pandas and transpose

With an input dataframe framed out of a given CSV, I need to transpose the data based on certain conditions. The groupby should be applied based on Key value.
For any value in the same 'Key' group, if the 'Type' is "T", these values should be written on "T" columns labelled as T1, T2, T3...and so on.
For any value in the same 'Key' group, if the 'Type' is "P" and 'Code' ends with "00" these values should be written on "U" columns labelled as U1, U2, U3...and so on.
For any value in the same 'Key' group, if the 'Type' is "P" and 'Code' doesn't end with "00" these values should be written on "P" columns labelled as P1, P2, P3...and so on.
There might be n number of values of type T & P for any Key value and the output columns for T & P should be updated accordingly
Input Dataframe:
df = pd.DataFrame({'Key': ['1', '1', '1', '1', '1', '2', '2', '2', '2', '2'],
'Value': ['T101', 'T102', 'P101', 'P102', 'P103', 'T201', 'T202', 'P201', 'P202', 'P203'],
'Type': ['T', 'T', 'P', 'P', 'P', 'T', 'T', 'P', 'P', 'P'],
'Code': ['0', '0', 'ABC00', 'TWY01', 'JTH02', '0', '0', 'OUJ00', 'LKE00', 'WDF45']
})
Expected Dataframe:
Can anyone suggest an effective solution for this case?
Here's a possible solution using pivot.
import pandas as pd
df = pd.DataFrame({'Key': ['1', '1', '1', '1', '1', '2', '2', '2', '2', '2'],
'Value': ['T101', 'T102', 'P101', 'P102', 'P103', 'T201', 'T202', 'P201', 'P202', 'P203'],
'Type': ['T', 'T', 'P', 'P', 'P', 'T', 'T', 'P', 'P', 'P'],
'Code': ['0', '0', 'ABC00', 'TWY01', 'JTH02', '0', '0', 'OUJ00', 'LKE00', 'WDF45']
})
# Set up the U label
df.loc[(df['Code'].apply(lambda x: x.endswith('00'))) & (df['Type'] == 'P'), 'Type'] = 'U'
# Type indexing by key by type
df = df.join(df.groupby(['Key','Type']).cumcount().rename('Tcount').to_frame() + 1)
df['Type'] = df['Type'] + df['Tcount'].astype('str')
# Pivot the table
pv =df.loc[:,['Key','Type','Value']].pivot(index='Key', columns='Type', values='Value')
>>>pv
Type P1 P2 T1 T2 U1 U2
Key
1 P102 P103 T101 T102 P101 NaN
2 P203 NaN T201 T202 P201 P202
cdf = df.loc[df['Code'] != '0', ['Key', 'Code']].groupby('Key')['Code'].apply(lambda x: ','.join(x))
>>>cdf
Key
1 ABC00,TWY01,JTH02
2 OUJ00,LKE00,WDF45
Name: Code, dtype: object
>>>pv.join(cdf)
P1 P2 T1 T2 U1 U2 Code
Key
1 P102 P103 T101 T102 P101 None ABC00,TWY01,JTH02
2 P203 None T201 T202 P201 P202 OUJ00,LKE00,WDF45

Create, and insert into, an Aerospike ordered map from Python

I see documentation for appending to a list in Aerospike, from Python, namely:
key = ('test', 'demo', 1)
rec = {'coutry': 'India', 'city': ['Pune', 'Dehli']}
client.put(key, rec)
client.list_append(key, 'city', 'Mumbai')
However I don't know how to add elements to a map in Aerospike, from Python, and I also don't know how to define said map as sorted.
Essentially I am trying to model a time series as follows:
ticker1: {intepochtime1: some_number, intepochtime2: some_other_number,...}
ticker2: {intepochtime1: some_number, intepochtime2: some_other_number,...}
........
where the tickers are the record keys, so are indexed obviously, but also where the intepochtimes are integer JS-style integer timestamps and are also indexed by virtue of being stored in ascending or descending order and therefore easily range-queryable. How is this doable from Python?
Here is some sample code to get you started:
Also on github: https://github.com/pygupta/aerospike-discuss/tree/master/stkovrflo_Py_SortedMaps
import aerospike
from aerospike import predicates as p
def print_result((key, metadata, record)):
print(record)
config = { 'hosts': [ ("localhost", 3000), ] }
client = aerospike.client(config).connect()
map_policy={'map_order':aerospike.MAP_KEY_VALUE_ORDERED}
# Insert the records
key = ("test", "demo", 'km1')
client.map_set_policy(key, "mymap", map_policy)
client.map_put(key, "mymap", '0', 13)
client.map_put(key, "mymap", '1', 3)
client.map_put(key, "mymap", '2', 7)
client.map_put(key, "mymap", '3', 2)
client.map_put(key, "mymap", '4', 12)
client.map_put(key, "mymap", '5', 33)
client.map_put(key, "mymap", '6', 1)
client.map_put(key, "mymap", '7', 12)
client.map_put(key, "mymap", '8', 22)
# Query for sorted value
print "Sorted by values, 2 - 14"
ret_val = client.map_get_by_value_range(key, "mymap", 2, 14, aerospike.MAP_RETURN_VALUE)
print ret_val
#get first 3 indexes
print "Index 0 - 3"
ret_val2 = client.map_get_by_index_range(key, "mymap", 0, 3, aerospike.MAP_RETURN_VALUE)
print ret_val2
pgupta#ubuntu:~/discussRepo/aerospike-discuss/stkovrflo_Py_SortedMaps$ python sortedMapExample.py
Sorted by values, 2 - 14
[2, 3, 7, 12, 12, 13]
Index 0 - 3
[13, 3, 7]
Look at Python documentation for Client.
Must be ver 3.8.4+
Create map policy :
Define one of the key ordered or key value ordered policies
http://www.aerospike.com/apidocs/python/client.html#map-policies for map_order
Put map type bin but first define the map policy.
http://www.aerospike.com/apidocs/python/client.html#id1
see map_set_policy(key, bin, map_policy)
then map_put()
Sorted maps are just regular maps but with map_order policy.
python3 mem leak fixed in client ver 2.0.8.