tfidf from word counts - pandas

I have a categorical variable with large cardinality (+1000). Each of these values can occur repeatedly in each train/test instance.
Although this is not really text data it seems to have similar properties and I would like to treat this as a text classification problem.
My starting point is a dataframe listing the number of occurrences of each "word" in each "document", e.g.
{'Word1': {0: '1',
1: '3',
2: '0',
3: '0',
4: '0'},
'Word2': {0: '0',
1: '2',
2: '0',
3: '0',
4: '0'}
I would like to apply tfidf transformation to these "word" counts. How can I do that?
sklearn.feature_extraction.text.TfidfVectorizer seems to expect a sequence of strings or a file as an input which it preprocesses and tokenizes. None of this is necessary in this case as I already have the "word" counts.
So how to get the tfidf transformation of these counts?

I had a similar situation where I was trying to recreate TF-IDF from word counts
Try below code, worked for me.
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["The dog ate a sandwich and I ate a sandwich",
"The wizard transfigured a sandwich"]
vectorizer = TfidfVectorizer(stop_words='english')
tfidfs = vectorizer.fit_transform(corpus)
from collections import Counter
import pandas as pd
columns = [k for (v, k) in sorted((v, k)
for k, v in vectorizer.vocabulary_.items())]
tfidfs = pd.DataFrame(tfidfs.todense(),
columns=columns)
# ate dog sandwich transfigured wizard
#0 0.75 0.38 0.54 0.00 0.00
#1 0.00 0.00 0.45 0.63 0.63
df = (1 / pd.DataFrame([vectorizer.idf_], columns=columns))
# ate dog sandwich transfigured wizard
#0 0.71 0.71 1.0 0.71 0.71
corp = [txt.lower().split() for txt in corpus]
corp = [[w for w in d if w in vectorizer.vocabulary_] for d in corp]
tfs = pd.DataFrame([Counter(d) for d in corp]).fillna(0).astype(int)
# ate dog sandwich transfigured wizard
#0 2 1 2 0 0
#1 0 0 1 1 1
# The first document's TFIDF vector:
tfidf0 = tfs.iloc[0] * (1. / df)
tfidf0 = tfidf0 / pd.np.linalg.norm(tfidf0)
# ate dog sandwich transfigured wizard
#0 0.754584 0.377292 0.536893 0.0 0.0
tfidf1 = tfs.iloc[1] * (1. / df)
tfidf1 = tfidf1 / pd.np.linalg.norm(tfidf1)
# ate dog sandwich transfigured wizard
#0 0.0 0.0 0.449436 0.631667 0.631667

Related

csv_amino acid composition_columnwise

Sample file:
Column 10: A|Y|E|A
Column 11: W|I|Q|Q
How do I calculate amino acid composition (percentage) specific to each column?
for ex: composition of A in column 10 is 50%, E is 25% and Y is 25%.
Biopython provides modules to calculate amino acid composition of entire file in fasta format
from Bio import SeqIO
from Bio.SeqUtils.ProtParam import ProteinAnalysis
for record in SeqIO.parse('output_translation3.fasta', 'fasta'):
X = ProteinAnalysis(str(record.seq))
print('\n Results for record: {}'.format(record.id))
print(X.count_amino_acids()['G'])
print(X.count_amino_acids()['A'])
print(X.count_amino_acids()['L'])
print(X.count_amino_acids()['M'])
from collections import Counter
import re
with open("input.txt") as f:
for line in f:
line=line.strip()
[col,sep,seq] = re.split(r'(: )', line)
aa = re.split(r'[|]', seq)
aa_counts = Counter(aa)
aa_length=len(aa)
print(col)
for k,v in aa_counts.items():
print(" ", k, v/aa_length)
Gives:
Column 10
A 0.5
Y 0.25
E 0.25
Column 11
W 0.25
I 0.25
Q 0.5

Random Choice loop through groups of samples

I have a df containing column of "Income_group", "Rate", and "Probability", respectively. I need randomly select rate for each income group. How can I write a Loop function and print out the result for each income bin.
The pandas data frame table looks like this:
import pandas as pd
df={'Income_Groups':['1','1','1','2','2','2','3','3','3'],
'Rate':[1.23,1.25,1.56, 2.11,2.32, 2.36,3.12,3.45,3.55],
'Probability':[0.25, 0.50, 0.25,0.50,0.25,0.25,0.10,0.70,0.20]}
df2=pd.DataFrame(data=df)
df2
Datatable
Shooting in the dark here, but you can use np.random.choice:
(df2.groupby('Income_Groups')
.apply(lambda x: np.random.choice(x['Rate'], p=x['Probability']))
)
Output (can vary due to randomness):
Income_Groups
1 1.25
2 2.36
3 3.45
dtype: float64
You can also pass size into np.random.choice:
(df2.groupby('Income_Groups')
.apply(lambda x: np.random.choice(x['Rate'], size=3, p=x['Probability']))
)
Output:
Income_Groups
1 [1.23, 1.25, 1.25]
2 [2.36, 2.11, 2.11]
3 [3.12, 3.12, 3.45]
dtype: object
GroupBy.apply because of the weights.
import numpy as np
(df2.groupby('Income_Groups')
.apply(lambda gp: np.random.choice(a=gp.Rate, p=gp.Probability, size=1)[0]))
#Income_Groups
#1 1.23
#2 2.11
#3 3.45
#dtype: float64
Another silly way because your weights seem to be have precision to 2 decimal places:
s = df2.set_index(['Income_Groups', 'Probability']).Rate
(s.repeat(s.index.get_level_values('Probability')*100) # Weight
.sample(frac=1) # Shuffle |
.reset_index() # + | -> Random Select
.drop_duplicates(subset=['Income_Groups']) # Select |
.drop(columns='Probability'))
# Income_Groups Rate
#0 2 2.32
#1 1 1.25
#3 3 3.45

Finding shortest values between the cities in a dataframe

I have a dataframe with cities and distance between other cities from each city. My dataset looks like,
df,
From City City A City B City C City D
City A 2166 577 175
City B 2166 1806 2092
City C 577 1806 653
City D 175 2092 653
im planning to visit all the cities, I am trying to find in which order by cities I can travel with the shortest distance. I want to end with a starting position. start point and end point should be same.
Is there a way to find this shortest distance across all the cities, or any other approach is available. please help.
Late answer. I've been facing the same problem. Leaving a solution (in case someone needs it) with tsp_solver to solve the TSP and networkx, pygraphviz to plot the results graph.
import numpy as np
import pandas as pd
from tsp_solver.greedy import solve_tsp
from tsp_solver.util import path_cost
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
# for Jupyter Notebook
from IPython.display import Image
Define the distances matrix DataFrame
# Define distances matrix dataframe
df = pd.DataFrame({
'A': [np.nan, 2166, 577, 175],
'B': [2166, np.nan, 1806, 2092],
'C': [577, 1806, np.nan, 653],
'D': [175, 2092, 653, np.nan]
}, index=['A', 'B', 'C', 'D'])
print(df)
A B C D
A NaN 2166.0 577.0 175.0
B 2166.0 NaN 1806.0 2092.0
C 577.0 1806.0 NaN 653.0
D 175.0 2092.0 653.0 NaN
Fill NaNs
# Fill NaNs with 0s
df.fillna(0, inplace=True)
# plot the matrix
sns.heatmap(df, annot=True, fmt='.0f', cmap="YlGnBu")
plt.show()
Take the lower nilpotent triangular matrix (square symmetric distance matrix)
# Take the lower nilpotent triangular matrix
lower_nilpotent_triangular_df = pd.DataFrame(
np.tril(df),
columns=df.columns,
index=df.index
)
print(lower_nilpotent_triangular_df)
A B C D
A 0.0 0.0 0.0 0.0
B 2166.0 0.0 0.0 0.0
C 577.0 1806.0 0.0 0.0
D 175.0 2092.0 653.0 0.0
# mask
mask = np.zeros_like(lower_nilpotent_triangular_df)
mask[np.triu_indices_from(mask)] = True
# plot the matrix
sns.heatmap(lower_nilpotent_triangular_df,
annot=True, fmt='.0f',
cmap="YlGnBu", mask=mask)
plt.show()
Solve the circular Traveling Salesman Problem
# Solve the circular shortest path
# NOTE: since it is circular, endpoints=(0,0)
# is equal to endpoints=(1,1) etc...
path = solve_tsp(lower_nilpotent_triangular_df.to_numpy(), endpoints=(0, 0))
path_len = path_cost(lower_nilpotent_triangular_df.to_numpy(), path)
# Take path labels from df
path_labels = df.columns[path].to_numpy()
print('shortest circular path:', path_labels)
print('path length:', path_len)
shortest circular path: ['A' 'D' 'B' 'C' 'A']
path length: 4650.0
Plot the graph with the shortest path
# Define graph edges widths
shortest_path_widths = df.copy(deep=True)
shortest_path_widths.loc[:,:] = .25
for idx0, idx1 in zip(path_labels[:-1], path_labels[1:]):
shortest_path_widths.loc[idx0, idx1] = 4.
shortest_path_widths.loc[idx1, idx0] = 4.
# Show the graph
G = nx.DiGraph()
for r in lower_nilpotent_triangular_df.columns:
for c in lower_nilpotent_triangular_df.index:
if not lower_nilpotent_triangular_df.loc[r, c]:
continue
G.add_edge(
r, c,
# scaled edge length
length=lower_nilpotent_triangular_df.loc[r, c]/250,
# edge label
label=int(lower_nilpotent_triangular_df.loc[r, c]),
# no direction
dir='none',
# edge width
penwidth=shortest_path_widths.loc[r, c]
)
# Add attributes
for u,v,d in G.edges(data=True):
d['label'] = d.get('label','')
d['len'] = d.get('length','')
d['penwidth'] = d.get('penwidth','')
A = nx.nx_agraph.to_agraph(G)
A.node_attr.update(color="skyblue", style="filled",
shape='circle', height=.4,
fixedsize=True)
A.edge_attr.update(color="black", fontsize=10)
A.draw('cities.png', format='png', prog='neato')
# Show image in Jupyter Notebook
Image('cities.png')

how to use xarray like pandas panel when adding new items

I have converted pandas panel to xarray but cannot add new items, major axis and minor axis as easily as I can with pandas panel. The code is below:
import numpy as np
import pandas as pd
import xarray as xr
panel = pd.Panel(np.random.randn(3, 4, 5), items=['one', 'two', 'three'],
major_axis=pd.date_range('1/1/2000', periods=4),
minor_axis=['a', 'b', 'c', 'd','e'])
if I want to add a new item for example, I can:
panel.four=pd.DataFrame(np.ones((4,5)),index=pd.date_range('1/1/2000', periods=4), columns=['a', 'b', 'c', 'd','e'])
panel.four
a b c d e
2000-01-01 1.0 1.0 1.0 1.0 1.0
2000-01-02 1.0 1.0 1.0 1.0 1.0
2000-01-03 1.0 1.0 1.0 1.0 1.0
2000-01-04 1.0 1.0 1.0 1.0 1.0
I have difficulty in increasing the items, major/minor axis in xarray
px=panel.to_xarray()
#px gives me
<xarray.DataArray (items: 3, major_axis: 5, minor_axis: 4)>
array([[[-0.440081, -0.888226, 0.158702, 2.107577],
[ 0.917835, -0.174557, 0.501626, 0.116761],
[ 0.406988, 1.95184 , -1.345948, 2.960774],
[-1.905529, 0.25793 , 0.076162, 1.954012],
[ 0.499675, 1.87567 , -1.698771, -1.143766]],
[[ 0.070269, -1.151737, -0.344155, -0.506383],
[-2.199357, -0.040909, 0.491984, -0.333431],
[-0.113155, -0.668475, 2.366683, -0.421863],
[-0.567336, -0.302224, 1.638386, -0.038545],
[ 0.55067 , -0.409266, -0.27916 , -0.942144]],
[[ 1.269171, -0.151471, -0.664072, 0.269168],
[-0.486492, 0.59632 , -0.191977, 0.22537 ],
[ 0.069231, -0.345793, -0.450797, -2.982 ],
[-0.42338 , -0.849736, 0.965738, -0.544596],
[-1.455378, -0.256441, -1.204572, -0.347749]]])
Coordinates:
* items (items) object 'one' 'two' 'three'
* major_axis (major_axis) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 ...
* minor_axis (minor_axis) object 'a' 'b' 'c' 'd'
#how should I add a fourth item, increase/delete major axis, minor axis?
xarray assignments are not as elegant as the pandas panel. Lets say we want to add a fourth item in the data array above. Here is how it works:
four=xr.DataArray(np.ones((1,4,5)), coords=[['four'],pd.date_range('1/1/2000', periods=4),['a', 'b', 'c', 'd','e']],
dims=['items','major_axis','minor_axis'])
pxc=xr.concat([px,four],dim='items')
Whether the operation is on items or major/minor axis, a similar logic prevails. For deleting use
pxc.drop(['four'], dim='items')
xarray.DataArray is based on a single NumPy array internally, so it cannot be efficiently resized or appended to. Your best option is to make a new, larger DataArray with xarray.concat.
The data structure you're probably looking if you want to add items to a pd.Panel is xarray.Dataset. These are easiest to construct from the multi-indexed DataFrame equivalent to a Panel:
# First, make a DataFrame with a MultiIndex
>>> df = panel.to_frame()
>>> df.head()
one two three
major minor
2000-01-01 a 0.278958 0.676034 -1.544726
b -0.918150 -2.707339 -0.552987
c 0.023479 0.175528 -0.817556
d 1.798001 -0.142016 1.390834
e 0.256575 0.265369 -1.829766
# Now, convert the DataFrame with a MultiIndex to xarray
>>> ds = df.to_xarray()
>>> ds
<xarray.Dataset>
Dimensions: (major: 4, minor: 5)
Coordinates:
* major (major) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* minor (minor) object 'a' 'b' 'c' 'd' 'e'
Data variables:
one (major, minor) float64 0.279 -0.9182 0.02348 1.798 0.2566 2.41 ...
two (major, minor) float64 0.676 -2.707 0.1755 -0.142 0.2654 ...
three (major, minor) float64 -1.545 -0.553 -0.8176 1.391 -1.83 ...
# You can assign a DataFrame if it has the right column/index names
>>> ds['four'] = pd.DataFrame(np.ones((4,5)),
... index=pd.date_range('1/1/2000', periods=4, name='major'),
... columns=pd.Index(['a', 'b', 'c', 'd', 'e'], name='minor'))
# or just pass a tuple directly:
>>> ds['five'] = (('major', 'minor'), np.zeros((4, 5)))
>>> ds
<xarray.Dataset>
Dimensions: (major: 4, minor: 5)
Coordinates:
* major (major) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* minor (minor) object 'a' 'b' 'c' 'd' 'e'
Data variables:
one (major, minor) float64 0.279 -0.9182 0.02348 1.798 0.2566 2.41 ...
two (major, minor) float64 0.676 -2.707 0.1755 -0.142 0.2654 ...
three (major, minor) float64 -1.545 -0.553 -0.8176 1.391 -1.83 ...
four (major, minor) float64 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ...
five (major, minor) float64 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
For more on transitioning from pandas.Panel to xarray, read this section in the xarray docs:
http://xarray.pydata.org/en/stable/pandas.html#transitioning-from-pandas-panel-to-xarray

Fill missing Values by a ratio of other values in Pandas

I have a column in a Dataframe in Pandas with around 78% missing values.
The remaining 22% values are divided between three labels - SC, ST, GEN with the following ratios.
SC - 16%
ST - 8%
GEN - 76%
I need to replace the missing values by the above three values so that the ratio of all the elements remains same as above. The assignment can be random as long the the ratio remains as above.
How do I accomplish this?
Starting with this DataFrame (only to create something similar to yours):
import numpy as np
df = pd.DataFrame({'C1': np.random.choice(['SC', 'ST', 'GEN'], p=[0.16, 0.08, 0.76],
size=1000)})
df.loc[df.sample(frac=0.22).index] = np.nan
It yields a column with 22% NaN and the remaining proportions are similar to yours:
df['C1'].value_counts(normalize=True, dropna=False)
Out:
GEN 0.583
NaN 0.220
SC 0.132
ST 0.065
Name: C1, dtype: float64
df['C1'].value_counts(normalize=True)
Out:
GEN 0.747436
SC 0.169231
ST 0.083333
Name: C1, dtype: float64
Now you can use fillna with np.random.choice:
df['C1'] = df['C1'].fillna(pd.Series(np.random.choice(['SC', 'ST', 'GEN'],
p=[0.16, 0.08, 0.76], size=len(df))))
The resulting column will have these proportions:
df['C1'].value_counts(normalize=True, dropna=False)
Out:
GEN 0.748
SC 0.165
ST 0.087
Name: C1, dtype: float64