create n dataframes in a for loop with an extra column with a specific number in it - dataframe

Hi all I have a dataframe like that shown in the picture:
I am trying to create 2 different dataframes with the same "hour", "minute", "value" (and value.1 respectively) columns, by adding also column with number 0 and 1 respectively). I would like to do it in a for loop as I want to create n dataframe (not just 2 shown here).
I tried something like this but it's not working (error: KeyError: "['value.i'] not in index"):
for i in range(1):
series[i] = df_new[['hour', 'minute', 'value.i']]
series[i].insert(0, 'number', 'i')
can you help me ?
thannks

from what I have understood you want to make value.i to show value.1 or value.2
for i in range(1):
# f is for the format so can interpret i as variable only
series[i] = df_new[['hour','minute',f'value.{i}']]

Related

DataFrame column contains characters that are not numbers - how to convert to integer?

I have the following Problem: I have a dataframe that looks like this:
A B
0 1 [5]
1 3 [1]
2 3 [118]
3 5 [34]
Now, I Need column B to only contain numbers, otherwise I can't work with the data. I already tried to use the replace-function and simply replace "[]" with "", but that didn't work out.
Is there any other way? Maybe I can convert the whole column to only keep the numbers as integers? That would be even better than just dropping the parenthesis.
I'm grateful for any help, I've been stuck with this for 2h now.
If your B column contains a string, use:
df['B'] = df['B'].str[1:-1].astype(int)
If your B column contains a list of one element, use:
df['B'] = df['B'].str[0]
Update
df['B'] = df['B'].str.extract(r'\[(.*)\]', expand=False).astype(int)

Pandas series replace value ignoring case but only if exact match

As Title says, I'm looking for a perfect solution to replace exact string in a series ignoring case.
ls = {'CAT':'abc','DOG' : 'def','POT':'ety'}
d = pd.DataFrame({'Data': ['cat','dog','pot','Truncate','HotDog','ShuPot'],'Result':['abc','def','ety','Truncate','HotDog','ShuPot']})
d
In the above code, ref hold the key-value pair where key is the existing value in a dataframe column and value is value to replace with.
Issue with this case is, service that pass the dictionary always holds dictionary key in upper case where dataframe might have value in lowercase.
expected output is stored in 'Result Column.
I tried including re.ignore = True which changes the last 2 values.
following code but that is not working as expected. it also converting values to upper case from previous iteration.
for k,v in ls.items():
print (k,v)
d['Data'] = d['Data'].astype(str).str.upper().replace({k:v})
print (d)
I'd appreciate any help.
Create a mapping series from the given dictionary, then transform the index of the mapping series to lower case, then using Series.map map the values in Data column to the values in mappings, then use Series.fillna to fill the missing values in the mapped series:
mappings = pd.Series(ls)
mappings.index = mappings.index.str.lower()
d['Result'] = d['Data'].str.lower().map(mappings).fillna(d['Data'])
# print(d)
Data Result
0 cat abc
1 dog def
2 pot ety
3 Truncate Truncate
4 HotDog HotDog
5 ShuPot ShuPot

How to show truncated form of large pandas dataframe after style.apply?

Normally, a relatively long dataframe like
df = pd.DataFrame(np.random.randint(0,10,(100,2)))
df
will display a truncated form in jupyter notebook like
With head, tail, ellipsis in between and row column count in the end.
However, after style.apply
def highlight_max(x):
return ['background-color: yellow' if v == x.max() else '' for v in x]
df.style.apply(highlight_max)
we got all rows displayed
Is it possible to still display the truncated form of dataframe after style.apply?
Something simple like this?
def display_df(dataframe, function):
display(dataframe.head().style.apply(function))
display(dataframe.tail().style.apply(function))
print(f'{dataframe.shape[0]} rows x {dataframe.shape[1]} columns')
display_df(df, highlight_max)
Output:
**** EDIT ****
def display_df(dataframe, function):
display(pd.concat([dataframe.iloc[:5,:],
pd.DataFrame(index=['...'], columns=dataframe.columns),
dataframe.iloc[-5:,:]]).style.apply(function))
print(f'{dataframe.shape[0]} rows x {dataframe.shape[1]} columns')
display_df(df, highlight_max)
Output:
The jupyter preview is basically something like this:
def display_df(dataframe):
display(pd.concat([dataframe.iloc[:5,:],
pd.DataFrame(index=['...'], columns=dataframe.columns, data={0: '...', 1: '...'}),
dataframe.iloc[-5:,:]]))
but if you try to apply style you are getting an error (TypeError: '>=' not supported between instances of 'int' and 'str') because it's trying to compare and highlight the string values '...'
You can capture the output in a variable and then use head or tail on it. This gives you more control on what you display every time.
output = df.style.apply(highlight_max)
output.head(10) # 10 -> number of rows to display
If you want to see more variate data you can also use sample, which will get random rows:
output.sample(10)

How to split a column into multiple columns and then count the null values in the new column in SQL or Pandas?

I have a relatively large table with thousands of rows and few tens of columns. Some columns are meta data and others are numerical values. The problem I have is, some meta data columns are incomplete or partial that is, it missed the string after a ":". I want to get a count of how many of these are with the missing part after the colon mark.
If you look at the miniature example below, what I should get is a small table telling me that in group A, MetaData is complete for 2 entries and incomplete (missing after ":") in other 2 entries. Ideally I also want to get some statistics on SomeValue (Count, max, min etc.).
How do I do it in an SQL query or in Python Pandas?
Might turn out to be simple to use some build in function however, I am not getting it right.
Data:
Group MetaData SomeValue
A AB:xxx 20
A AB: 5
A PQ:yyy 30
A PQ: 2
Expected Output result:
Group MetaDataComplete Count
A Yes 2
A No 2
No reason to use split functions (unless the value can contain a colon character.) I'm just going to assume that the "null" values (not technically the right word) end with :.
select
"Group",
case when MetaData like '%:' then 'Yes' else 'No' end as MetaDataComplete,
count(*) as "Count"
from T
group by "Group", case when MetaData like '%:' then 'Yes' else 'No' end
You could also use right(MetaData, 1) = ':'.
Or supposing that values can contain their own colons, try charindex(':', MetaData) = len(MetaData) if you just want to ask whether the first colon is in the last position.
Here is an example:
## 1- Create Dataframe
In [1]:
import pandas as pd
import numpy as np
cols = ['Group', 'MetaData', 'SomeValue']
data = [['A', 'AB:xxx', 20],
['A', 'AB:', 5],
['A', 'PQ:yyy', 30],
['A', 'PQ:', 2]
]
df = pd.DataFrame(columns=cols, data=data)
# 2- New data frame with split value columns
new = df["MetaData"].str.split(":", n = 1, expand = True)
df["MetaData_1"]= new[0]
df["MetaData_2"]= new[1]
# 3- Dropping old MetaData columns
df.drop(columns =["MetaData"], inplace = True)
## 4- Replacing empty string by nan and count them
df.replace('',np.NaN, inplace=True)
df.isnull().sum()
Out [1]:
Group 0
SomeValue 0
MetaData_1 0
MetaData_2 2
dtype: int64
From a SQL perspective, performing a split is painful, not mention using the split results in having to perform the query first then querying the results:
SELECT
Results.[Group],
Results.MetaData,
Results.MetaValue,
COUNT(Results.MetaValue)
FROM (SELECT
[Group]
MetaData,
SUBSTRING(MetaData, CHARINDEX(':', MetaData) + 1, LEN(MetaData)) AS MetaValue
FROM VeryLargeTable) AS Results
GROUP BY Results.[Group],
Results.MetaData,
Results.MetaValue
If your just after a count, you could also try the algorithmic approach. Just loop over the data and use regular expressions with negative lookahead.
import pandas as pd
import re
pattern = '.*:(?!.)' # detects the strings of the missing data form
missing = 0
not_missing = 0
for i in data['MetaData'].tolist():
match = re.findall(pattern, i)
if match:
missing += 1
else:
not_missing += 1

Histogram of a pandas dataframe

I couldn't find anywhere on the site a similar question.
I have a fairly large file, with over 100000 lines and I read it using pandas:
df = pd.read_excel("somefile.xls",index_col='Offense Type')
ended up with a dataframe consisting of the first column (the index column) and another column, 'Offense_type' and 'Hour' respectively.
'Offense Type' consists of a series of "cathegories" say cat1, cat2, cat3, etc...
'Hour' consists of a series of integer numbers between 1 and 24.
What I would like to do is obtain a histogram of the ocurrences of each number in the dataframe (there aren't that many cathegories It's at most 10 of them)
Here's an ASCII representation of what I want to get"
(the x's represent the bars in the histogram, they will surely be at a much higher value than 1,2 or 3)
x x # And so on
x x x x x x #
x x x x x x x #
1 2 11 20 5 8 18 #
Cat1 Cat2 #
But i'm getting a single barplot for every line in df using:
df.plot(kind='bar')
which is basically unreadable:
I've also tried with the hist() and Histogram() function with no luck.
Here's some sample data:
After a long night, I got the answer since every event was ocurring only once I added an extra column in the file with the number one and then indexed the dataframe by this:
df = pd.read_excel("somefile.xls",index_col='Numberone')
And then simply tried this:
df.hist(by=df['Offense Type'])
finally getting exactly what I wanted