Select data from pandas multiindex pivottable - pandas

I have a multiindex dataframe (pivottable) with 1703 rows that looks like this:
Local code Ex Code ... Value
159605 FR1xx ... 30
159973 FR1xx ... 50
...
ZZC923HDV906 XYxx ... 20
There are either numerical local codes (e.g. 159973) or local codes consisting of both characters and strings (e.g. ZZC923HDV906)
I'd like to select the data by the first index column (Local code). This works well for the string characters with the following code
pv_comb[(pv_comb.index.get_level_values("Local code") == "ZZC923HDV906")]
However I don't manage to select the numerical values:
pv_comb[(pv_comb.index.get_level_values("Local code") == 159973)]
This returns an empty dataframe.
Is it possible to convert the values in the first column of the multiindex into string characters and then select the data?

IIUC you need '', because your numeric values are strings - so 159973 change to '159973':
pv_comb[(pv_comb.index.get_level_values("Local code") == '159973')]
If need convert some level of MultiIndex to string need create new index and then assign:
#if python 3 add list
new_index = list(zip(df.index.get_level_values('Local code').astype(str),
df.index.get_level_values('Ex Code')))
df.index = pd.MultiIndex.from_tuples(new_index, names = df.index.names)
Also is possible there are some whitespaces, you can remove them by strip:
#change multiindex
new_index = zip(df.index.get_level_values('Local code').astype(str).str.strip(),
df.index.get_level_values('Ex Code')
df.index = pd.MultiIndex.from_tuples(new_index, names = df.index.names)
Solution if many levels is first reset_problematic level, do operations and set index back. Then is possible sortlevel is necessary:
df = pd.DataFrame({'Local code':[159605,159973,'ZZC923HDV906'],
'Ex Code':['FR1xx','FR1xx','XYxx'],
'Value':[30,50,20]})
pv_comb = df.set_index(['Local code','Ex Code'])
print (pv_comb)
Value
Local code Ex Code
159605 FR1xx 30
159973 FR1xx 50
ZZC923HDV906 XYxx 20
pv_comb = pv_comb.reset_index('Local code')
pv_comb['Local code'] = pv_comb['Local code'].astype(str)
pv_comb = pv_comb.set_index('Local code', append=True).swaplevel(0,1)
print (pv_comb)
Value
Local code Ex Code
159605 FR1xx 30
159973 FR1xx 50
ZZC923HDV906 XYxx 20

Related

Pandas - Setting column value, based on a function that runs on another column

I have been all over the place to try and get this to work (new to datascience). It's obviously because I don't get how the datastructure of Panda fully works.
I have this code:
def getSearchedValue(identifier):
full_str = anedf["Diskret data"].astype(str)
value=""
if full_str.str.find(identifier) <= -1:
start_index = full_str.str.find(identifier)+len(identifier)+1
end_index = full_str[start_index:].find("|")+start_index
value = full_str[start_index:end_index].astype(str)
return value
for col in anedf.columns:
if col.count("#") > 0:
anedf[col] = getSearchedValue(col)
What i'm trying to do is iterate over my columns. I have around 260 in my dataframe. If they contain the character #, it should try to fill values based on whats in my "Diskret data" column.
Data in the "Diskret data" column is completely messed up but in the form CCC#111~VALUE|DDD#222~VALUE| <- Until there is no more identifiers + values. All identifiers are not present in each row, and they come in no specific order.
The function works if I run it with hard coded strings in regular Python document. But with the dataframe I get various error like:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Input In [119], in <cell line: 12>()
12 for col in anedf.columns:
13 if col.count("#") > 0:
---> 14 anedf[col] = getSearchedValue(col)
Input In [119], in getSearchedValue(identifier)
4 full_str = anedf["Diskret data"].astype(str)
5 value=""
----> 6 if full_str.str.find(identifier) <= -1:
7 start_index = full_str.str.find(identifier)+len(identifier)+1
8 end_index = full_str[start_index:].find("|")+start_index
I guess this is because it evaluate against all rows (Series) which obviously provides some false and true errors. But how can I make the evaluation and assignment so it it's evaluating+assigning like this:
Diskret data
CCC#111
JJSDJ#1234
CCC#111~1IBBB#2323~2234
1 (copied from "Diskret data")
0
JJSDJ#1234~Heart attack
0 (or skipped since the row does not contain a value for the identifier)
Heart attack
The plan is to drop the "Diskret data" when the assignment is done, so I have the data in a more structured way.
--- Update---
By request:
I have included a picture of how I visualize the problem, And what I seemingly can't make it do.
Problem visualisation
With regex you could do something like:
def map_(list_) -> pd.Series:
if list_:
idx, values = zip(*list_)
return pd.Series(values, idx)
else:
return pd.Series(dtype=object)
series = pd.Series(
['CCC#111~1|BBB#2323~2234', 'JJSDJ#1234~Heart attack']
)
reg_series = series.str.findall(r'([^~|]+)~([^~|]+)')
reg_series.apply(map_)
Breaking this down:
Create a new series by running a map on each row that turns your long string into a list of tuples
Create a new series by running a map on each row that turns your long string into a list of tuples.
reg_series = series.str.findall(r'([^~|]+)~([^~|]+)')
reg_series
# output:
# 0 [(CCC#111, 1), (BBB#2323, 2234)]
# 1 [(JJSDJ#1234, Heart attack)]
Then we create a map_ function. This function takes each row of reg_series and maps it to two rows: the first with only the "keys" and the other with only the "values". We then create series of this with the index as the keys and the values as the values.
Edit: We added in a if/else statement that check whether the list exists. If it does not, we return an empty series of type object.
def map_(list_) -> pd.Series:
if list_:
idx, values = zip(*list_)
return pd.Series(values, idx)
else:
return pd.Series(dtype=object)
...
print(idx, values) # first row
# output:
# ('CCC#111', 'BBB#2323') (1, 2234)
Finally we run apply on the series to create a dataframe that takes the outputs from map_ for each row and zips them together in columnar format.
reg_series.apply(map_)
# output:
# CCC#111 BBB#2323 JJSDJ#1234
# 0 1 2234 NaN
# 1 NaN NaN Heart attack

Getting same value from list in dataframe column using Python

I have dataframe in which there 3 columns, Now, I added one more column and in which I am adding unique values using random function.
I created list variable and using for loop I am adding random string in that list variable
after that, I created another loop in which I am extracting value of list and adding it in column's value.
But, Same value is adding in each row everytime.
df = pd.read_csv("test.csv")
lst = []
for i in range(20):
randColumn = ''.join(random.choice(string.ascii_uppercase + string.digits)
for i in range(20))
lst.append(randColumn)
for j in lst:
df['randColumn'] = j
print(df)
#Output.......
A B C randColumn
0 1 2 3 WHI11NJBNI8BOTMA9RKA
1 4 5 6 WHI11NJBNI8BOTMA9RKA
Could you please help me to fix this that Why each row has same value from list.
Updated to work correctly with any type of column in df.
If I got your question clearly, you can use method zip of rdd to achieve your goals.
from pyspark.sql import SparkSession, Row
import pyspark.sql.types as t
lst = []
for i in range(2):
rand_column = ''.join(random.choice(string.ascii_uppercase + string.digits) for i in range(20))
# Adding random strings as Row to list
lst.append(Row(random=rand_column))
# Making rdd from random strings array
random_rdd = sparkSession.sparkContext.parallelize(lst)
res = df.rdd.zip(random_rdd).map(lambda rows: Row(**(rows[0].asDict()), **(rows[1].asDict()))).toDF()

Pandas series replace value ignoring case but only if exact match

As Title says, I'm looking for a perfect solution to replace exact string in a series ignoring case.
ls = {'CAT':'abc','DOG' : 'def','POT':'ety'}
d = pd.DataFrame({'Data': ['cat','dog','pot','Truncate','HotDog','ShuPot'],'Result':['abc','def','ety','Truncate','HotDog','ShuPot']})
d
In the above code, ref hold the key-value pair where key is the existing value in a dataframe column and value is value to replace with.
Issue with this case is, service that pass the dictionary always holds dictionary key in upper case where dataframe might have value in lowercase.
expected output is stored in 'Result Column.
I tried including re.ignore = True which changes the last 2 values.
following code but that is not working as expected. it also converting values to upper case from previous iteration.
for k,v in ls.items():
print (k,v)
d['Data'] = d['Data'].astype(str).str.upper().replace({k:v})
print (d)
I'd appreciate any help.
Create a mapping series from the given dictionary, then transform the index of the mapping series to lower case, then using Series.map map the values in Data column to the values in mappings, then use Series.fillna to fill the missing values in the mapped series:
mappings = pd.Series(ls)
mappings.index = mappings.index.str.lower()
d['Result'] = d['Data'].str.lower().map(mappings).fillna(d['Data'])
# print(d)
Data Result
0 cat abc
1 dog def
2 pot ety
3 Truncate Truncate
4 HotDog HotDog
5 ShuPot ShuPot

How to split a column into multiple columns and then count the null values in the new column in SQL or Pandas?

I have a relatively large table with thousands of rows and few tens of columns. Some columns are meta data and others are numerical values. The problem I have is, some meta data columns are incomplete or partial that is, it missed the string after a ":". I want to get a count of how many of these are with the missing part after the colon mark.
If you look at the miniature example below, what I should get is a small table telling me that in group A, MetaData is complete for 2 entries and incomplete (missing after ":") in other 2 entries. Ideally I also want to get some statistics on SomeValue (Count, max, min etc.).
How do I do it in an SQL query or in Python Pandas?
Might turn out to be simple to use some build in function however, I am not getting it right.
Data:
Group MetaData SomeValue
A AB:xxx 20
A AB: 5
A PQ:yyy 30
A PQ: 2
Expected Output result:
Group MetaDataComplete Count
A Yes 2
A No 2
No reason to use split functions (unless the value can contain a colon character.) I'm just going to assume that the "null" values (not technically the right word) end with :.
select
"Group",
case when MetaData like '%:' then 'Yes' else 'No' end as MetaDataComplete,
count(*) as "Count"
from T
group by "Group", case when MetaData like '%:' then 'Yes' else 'No' end
You could also use right(MetaData, 1) = ':'.
Or supposing that values can contain their own colons, try charindex(':', MetaData) = len(MetaData) if you just want to ask whether the first colon is in the last position.
Here is an example:
## 1- Create Dataframe
In [1]:
import pandas as pd
import numpy as np
cols = ['Group', 'MetaData', 'SomeValue']
data = [['A', 'AB:xxx', 20],
['A', 'AB:', 5],
['A', 'PQ:yyy', 30],
['A', 'PQ:', 2]
]
df = pd.DataFrame(columns=cols, data=data)
# 2- New data frame with split value columns
new = df["MetaData"].str.split(":", n = 1, expand = True)
df["MetaData_1"]= new[0]
df["MetaData_2"]= new[1]
# 3- Dropping old MetaData columns
df.drop(columns =["MetaData"], inplace = True)
## 4- Replacing empty string by nan and count them
df.replace('',np.NaN, inplace=True)
df.isnull().sum()
Out [1]:
Group 0
SomeValue 0
MetaData_1 0
MetaData_2 2
dtype: int64
From a SQL perspective, performing a split is painful, not mention using the split results in having to perform the query first then querying the results:
SELECT
Results.[Group],
Results.MetaData,
Results.MetaValue,
COUNT(Results.MetaValue)
FROM (SELECT
[Group]
MetaData,
SUBSTRING(MetaData, CHARINDEX(':', MetaData) + 1, LEN(MetaData)) AS MetaValue
FROM VeryLargeTable) AS Results
GROUP BY Results.[Group],
Results.MetaData,
Results.MetaValue
If your just after a count, you could also try the algorithmic approach. Just loop over the data and use regular expressions with negative lookahead.
import pandas as pd
import re
pattern = '.*:(?!.)' # detects the strings of the missing data form
missing = 0
not_missing = 0
for i in data['MetaData'].tolist():
match = re.findall(pattern, i)
if match:
missing += 1
else:
not_missing += 1

Copying column values between a range of index doesn't work

I am trying to convert values in 2 dimensions in excel to one dimension, adding columns one under the other. However, this script doesn't add the values to a specific row range.
I am using pandas to do that. Excel file is here:
https://drive.google.com/file/d/1dfsfJhLFoGiO8_FG4kmZ87JxT2XFBpvX/view?usp=sharing
import pandas as pd
inpExcelFile = 'C:/sample.xlsx'
gridCells = pd.read_excel(inpExcelFile, sheetname='Sheet1')
Filter=pd.DataFrame()
for i in range(1938, 1940, 1):
gridCells_filter = gridCells[gridCells['Year']==i]
gridCells_filter=gridCells_filter.reset_index(drop=True)
gridCells_filter.replace(to_replace =",", value =".")
#BELOW IS COPYING COLUMN
Filter.at[0:30,'Filtered '+str(i)]=gridCells_filter.loc[0:30,'JAN']
#AFTER THIS, IT DOESNT COPY COLUMN VALUES
Filter.at[31:61,'Filtered '+str(i)]=gridCells_filter.loc[0:30,'FEB']
Filter.at[62:92,'Filtered '+str(i)]=gridCells_filter.loc[0:30,'MAR']
Filter.at[93:123,'Filtered '+str(i)]=gridCells_filter.loc[:30,'APR']
Filter.at[124:154,'Filtered '+str(i)]=gridCells_filter.loc[0:30,'MAY']
Filter.loc[155:185,'Filtered '+str(i)]=gridCells_filter.loc[0:30,'JUN']
Filter.at[186:216,'Filtered '+str(i)]=gridCells_filter.loc[0:30,'JUL']
Filter.at[217:247,'Filtered '+str(i)]=gridCells_filter.loc[0:30,'AUG']
Filter.at[248:278,'Filtered '+str(i)]=gridCells_filter.loc[0:30,'SEP']
Filter.at[279:309,'Filtered '+str(i)]=gridCells_filter.loc[0:30,'OCT']
Filter.at[310:340,'Filtered '+str(i)]=gridCells_filter.loc[0:30,'NOV']
Filter.at[341:371,'Filtered '+str(i)]=gridCells_filter.loc[0:30,'DEC']
Filter[Filter.Filtered +str(i) != '-----']
The expected result is that all columns values are needed to be in one column as desired order.
You can use general solution for all years - reshape by DataFrame.melt and use to_datetime with DataFrame.pop for extract columns, last sorting by DataFrame.sort_values and remove bad datetimes like 30.2.1938 by DataFrame.dropna:
df = pd.read_excel('sample.xlsx', decimal=',')
df = df.melt(['DAY','Year'], value_name='val')
s = df.pop('DAY').astype(str) + df.pop('variable') + df.pop('Year').astype(str)
df['datetime'] = pd.to_datetime(s, format='%d%b%Y', errors='coerce')
df = df.sort_values('datetime').dropna(subset=['datetime'])
val datetime
279 --- 1938-01-01
280 --- 1938-01-02
281 --- 1938-01-03
282 --- 1938-01-04
283 --- 1938-01-05