Write integer as well as float values to a text file

Write integer as well as float values to a text file - numpy

Following is the code I am using to generate a list and write it to a text file:
import numpy as np
c = 0
a = []
for i in range(1, 16,1):
b = i/10
c += 1
a.append([c,b])
np.savetxt('test.txt', a, delimiter=" ", fmt="%s")
When the list a is printed, the values taken by c are integers. However, when the list a is written to a file, c becomes float. Is it possible to append float and also integer to a text file using numpy.savetxt?

You can specify the format of each value. In your case where np.array(a) produce a 2D array with 2 columns:
np.savetxt('your_file.txt',a,delimiter=' ',fmt='%d %f')
Where fmt = '%d %f' correspond to an integer followed by a float.
The .txt file now contains:
1 0.100000
2 0.200000
3 0.300000
4 0.400000
5 0.500000
6 0.600000
7 0.700000
8 0.800000
9 0.900000
10 1.000000
11 1.100000
12 1.200000
13 1.300000
14 1.400000
15 1.500000

Related

Panda key value pair data frame

Does panda can convert the key value to customized table. Here is the sample of the data.
1675484100 customer=A.1 area=1 height=20 width={10,10} length=1
1675484101 customer=B.1 area=10 height=30 width={20,11} length=2
1675484102 customer=C.1 area=11 height=40 width={30,12} length=3 remarks=call
Generate a table with key as a header and the associated value. First field as a time.

I would use a regex to get each key/value pair, then reshape:
data = '''1675484100 customer=A.1 area=1 height=20 width={10,10} length=1
1675484101 customer=B.1 area=10 height=30 width={20,11} length=2
1675484102 customer=C.1 area=11 height=40 width={30,12} length=3 remarks=call'''
df = (pd.Series(data.splitlines()).radd('time=')
.str.extractall(r'([^\s=]+)=([^\s=]+)')
.droplevel('match').set_index(0, append=True)[1]
# unstack keeping order
.pipe(lambda d: d.unstack()[d.index.get_level_values(-1).unique()])
)
print(df)
Output:
0 time customer area height width length remarks
0 1675484100 A.1 1 20 {10,10} 1 NaN
1 1675484101 B.1 10 30 {20,11} 2 NaN
2 1675484102 C.1 11 40 {30,12} 3 call

Assuming that your input is a string defined as data, you can use this :
L = [{k: v for k, v in (x.split("=") for x in l.split()[1:])}
for l in data.split("\n") if l.strip()]

df = pd.DataFrame(L)

df.insert(0, "time", [pd.to_datetime(int(x.split()[0]), unit="s")
for x in data.split("\n")])
Otherwise, if the data are stored in some sort of a (.txt) file, add this at the beginning :
with open("file.txt", "r") as f:
data = f.read()
Output :
print(df)

time customer area height width length remarks
0 2023-02-04 04:15:00 A.1 1 20 {10,10} 1 NaN
1 2023-02-04 04:15:01 B.1 10 30 {20,11} 2 NaN
2 2023-02-04 04:15:02 C.1 11 40 {30,12} 3 call

How to use dask dataframe instead of pandas to make a faster calculation

demo csv file:
label1 label2 m1
0 KeyT1_L1_1_animebook0000_1 KeyT1_L1_1_animebook0000_1 0.000000
1 KeyT1_L1_1_animebook0000_1 KeyT1_L1_1_animebook0001_1 1.000000
2 KeyT1_L1_1_animebook0000_1 KeyT1_L1_1_animebook0002_1 1.000000
3 KeyT1_L1_1_animebook0000_1 KeyT1_L1_1_animebook0003_1 1.414214
4 KeyT1_L1_1_animebook0000_1 KeyT1_L1_1_animebook0004_1 2.000000
5 KeyT1_L1_1_animebook0000_1 KeyT1_L1_1_animebook0005_1 2.000000
6 KeyT1_L1_1_animebook0000_1 KeyT1_L1_1_animebook0006_1 3.000000
7 KeyT1_L1_1_animebook0000_1 KeyT1_L1_1_animebook0007_1 3.162278
8 KeyT1_L1_1_animebook0000_1 KeyT1_L1_1_animebook0008_1 4.000000
9 KeyT1_L1_1_animebook0000_1 KeyT1_L1_1_animebook0009_1 5.000000
10 KeyT1_L1_1_animebook0000_1 KeyT1_L1_1_animebook0010_1 5.000000
11 KeyT1_L1_1_animebook0000_1 KeyT1_L1_1_animebook0011_1 6.000000
12 KeyT1_L1_1_animebook0000_1 KeyT1_L1_1_animebook0012_1 6.000000
13 KeyT1_L1_1_animebook0000_1 KeyT1_L1_1_animebook0013_1 6.000000
14 KeyT1_L1_1_animebook0000_1 KeyT1_L1_1_animebook0014_1 6.000000
From this CSV file, I will do some comparison operation. I will have a function which will make comparison and return minimum from the combination.
There are 160000 rows. Using pandas and for loop are taking a lot of time. Can I make it faster using dask? I tried dask dataframe from pandas but when I am using to_list which I can used for pandas column, it's giving me error. I have core i7 machine and ram of 128 gb Below is my code:
"""
#the purpose of this function is to calculate different rows...
#values for the m1 column of data frame. there could be two
#combinations and inside combination it needs to get m1 value for the row
#suppose first comb1 will calucalte sum of m1 value of #row(KeyT1_L1_1_animebook0000_1,KeyT1_L1_1_animebook0001_1) and
#row(KeyT1_L1_1_animebook0000_1,KeyT1_L1_1_animebook0001_2)
a more details of this function could be found here:
(https://stackoverflow.com/questions/72663618/writing-a-python-function-to-get-desired-value-from-csv/72677299#72677299)
def compute(img1,img2):
comb1=(img1_1,img2_1)+(img1_1,img2_2)
comb2=(img1_2,img2_1)+(img1_2,img2_2)
return minimum(comb1,comb2)
"""
def min_4line(key1,key2,list1,list2,df):
k=['1','2','3','4']
indice_list=[]
key1_line1=key1+'_'+k[0]
key1_line2=key1+'_'+k[1]
key1_line3=key1+'_'+k[2]
key1_line4=key1+'_'+k[3]
key2_line1=key2+'_'+k[0]
key2_line2=key2+'_'+k[1]
key2_line3=key2+'_'+k[2]
key2_line4=key2+'_'+k[3]
ind1=df.index[(df['label1']==key1_line1) & (df['label2']==key2_line1)].tolist()
ind2=df.index[(df['label1']==key1_line2) & (df['label2']==key2_line2)].tolist()
ind3=df.index[(df['label1']==key1_line3) & (df['label2']==key2_line3)].tolist()
ind4=df.index[(df['label1']==key1_line4) & (df['label2']==key2_line4)].tolist()
comb1=int(df.loc[ind1,'m1'])+int(df.loc[ind2,'m1'])+int(df.loc[ind3,'m1'])+int(df.loc[ind4,'m1'])
ind1=df.index[(df['label1']==key1_line2) & (df['label2']==key2_line1)].tolist()
ind2=df.index[(df['label1']==key1_line3) & (df['label2']==key2_line2)].tolist()
ind3=df.index[(df['label1']==key1_line4) & (df['label2']==key2_line3)].tolist()
ind4=df.index[(df['label1']==key1_line1) & (df['label2']==key2_line4)].tolist()
comb2=int(df.loc[ind1,'m1'])+int(df.loc[ind2,'m1'])+int(df.loc[ind3,'m1'])+int(df.loc[ind4,'m1'])
return min(comb1,comb2)
Now, have to create unique list of labels to do the comparison:
list_line=list(df3['label1'].unique())
string_test=[a[:-2] for a in list_line]
#above list comprehension is done as we will get unique label like animebook0000_1,animebook0001_1
list_key=sorted(list(set(string_test)))
print(len(list_key))
#making list of those two column
lable1_list=df3['label1'].to_list()
lable2_list=df3['label2'].to_list()
Next, I will write the output of the comparison function in an excel
%%time
file = open("content\\dummy_metric.csv", "a")
file.write("label1,label2,m1\n")
c=0
for i in range(len(list_key)):
for j in range(i+1,len(list_key)):
a=min_4line(list_key[i],list_key[j] ,lable1_list,lable2_list,df3)
#print(a)
file.write(str(list_key[i]) + "," + str(list_key[j]) + "," + str(a)+ "\n")
c+=1
if c>20000:
print('20k done')
my expected output:
label1 label2 m1
0 KeyT1_L1_1_animebook0000 KeyT1_L1_1_animebook0001 2
1 KeyT1_L1_1_animebook0000 KeyT1_L1_1_animebook0002 2
2 KeyT1_L1_1_animebook0000 KeyT1_L1_1_animebook0003 2
3 KeyT1_L1_1_animebook0000 KeyT1_L1_1_animebook0004 4
4 KeyT1_L1_1_animebook0000 KeyT1_L1_1_animebook0005 5
5 KeyT1_L1_1_animebook0000 KeyT1_L1_1_animebook0006 7
6 KeyT1_L1_1_animebook0000 KeyT1_L1_1_animebook0007 9
7 KeyT1_L1_1_animebook0000 KeyT1_L1_1_animebook0008 13
For dask I was proceeding like this:
import pandas as pd
import dask.dataframe as dd
csv_gb=pd.read_csv("content\\four_metric.csv")
dda = dd.from_pandas(csv_gb, npartitions=10)
Upto that line is fine, but when I want to do the list of label like this
lable1_list=df3['label1'].to_list()
it's showing me this error:
2022-07-05 16:31:17,530 - distributed.worker - WARNING - Compute Failed
Key: ('unique-combine-5ce843b510d3da88b71287e6839d3aa3', 0, 1)
Function: execute_task
args: ((<function pipe at 0x0000022E39F18160>, [0 KeyT1_L1_1_animebook0000_1
.....
25 KeyT1_L1_1_animebook_002
kwargs: {}
Exception: 'TypeError("\'Serialize\' object is not callable")'
Is there any better way to perform the above mentioned code with dask? I am also curious about using dask distributed function like this for my task:
from dask.distributed import Client
client = Client()
client = Client(n_workers=3, threads_per_worker=1, processes=False, memory_limit='40GB')

Computing JaroWinkler Similarity for unordered and different sized dataframes

I have two dataframes extracted from two attached files.
I want to compute JaroWinkler Similarity for tokens inside the files. I am using below code.
from similarity.jarowinkler import JaroWinkler
jarowinkler = JaroWinkler()
df_gt['jarowinkler_sim'] = [jarowinkler.similarity(x.lower(), y.lower()) for x, y in zip(df_ex['abstract_ex'], df_gt['abstract_gt'])]
I am facing two problems:
1. Order of the tokens are not being handled.
When position of the token 'can' and 'interesting' is changed similarity index is wrongly computed!!
Unnamed: 0 abstract_gt jarowinkler_sim
0 0 Bipartite 1.000000
1 1 ﬂuctuations 0.914141
2 2 can 0.474747 <--|
3 3 provide 1.000000 |-- Position swapped in one file
4 4 interesting 0.474747 <--|
5 5 information 1.000000
6 6 about 1.000000
7 7 entanglement 1.000000
8 8 properties 1.000000
9 9 and 1.000000
10 10 correlations 1.000000
2. Size of the dataframe might not be always same.
When one of the dataframe contains less elements my solution gives an error.
raise ValueError( ValueError: Length of values (10) does not match
length of index (11)
How can I solve these two problems and compute the similarity accurately?
Thanks !!
TSV FILES
1. df_ex
abstract_ex
0 Bipartite
1 fluctuations
2 interesting
3 provide
4 can
5 information
6 about
7 entanglement
8 properties
9 and
10 correlations
df_gt
abstract_gt
0 Bipartite
1 ﬂuctuations
2 interesting
3 provide
4 can
5 information
6 about
7 entanglement
8 properties
9 and
10 correlations

plot dataframe column on one axis and other columns as separate lines on the same plot (in different color)

I have following dataframe.
precision recall F1 cutoff
cutoff
0 0.690148 1.000000 0.814610 0
1 0.727498 1.000000 0.839943 1
2 0.769298 0.916667 0.834051 2
3 0.813232 0.916667 0.859741 3
4 0.838062 0.833333 0.833659 4
5 0.881454 0.833333 0.854946 5
6 0.925455 0.750000 0.827202 6
7 0.961111 0.666667 0.786459 7
8 0.971786 0.500000 0.659684 8
9 0.970000 0.166667 0.284000 9
10 0.955000 0.083333 0.152857 10
I want to plot cutoff column on x-axis and precision,recall and F1 values as separate lines on the same plot (in different color). How can I do it?
When I am trying to plot the dataframe, it is taking the cutoff column also for plotting.
Thanks

Remove column before ploting:
df.drop('cutoff', axis=1).plot()
But maybe problem is how is created index, maybe help change:
df = df.set_index(df['cutoff'])
df.drop('cutoff', axis=1).plot()
to:
df = df.set_index('cutoff')
df.plot()

remove redundant signals in pandas

I want to build correspondance between col1 and col2 with certain rule.
Label1 is like an on switch, and label2 is like an off switch. Once label1 is on, further operation on label1 will not re-open the switch until it is switched off by label2. Then label1 can switch on again.
For example, I have a following table:
index label1 label2 note
1 F T label2 is invalid because not switch on yet
2 T F label1 switch on
3 F F
4 T F useless action because it's on already
5 F T switch off
6 F F
7 T F switch on
8 F F
9 F T switch off
10 F F
11 F T invalid off operation, not on
The correct output is something like:
label1ix label2ix
2 5
7 9
What I tries is :
df['label2ix'] = df.loc[df.label2==T, index] # find the label2==True index
df['label2ix'].bfill(inplace=True) # backfill the column
mask = (df['label1'] == T) # label1==True, then get the index and label2ix
newdf = pd.Dataframe(df.loc[mask, ['index', 'label2ix']])
This is not correct because I have got is:
label1ix label2ix note
2 5 correct
4 5 wrong operation
7 9 correct
I am not sure how to filter out row 4.
I have got another idea,
df['label2ix'] = df.loc[df.label2==T, index] # find the label2==True index
df['label2ix'].bfill(inplace=True) # backfill the column
groups = df.groupby('label2ix')
firstlabel1 = groups['label1'].first()
But for this solution, I don't know how to get the first label1=T in each group.
And I am not sure if there is any more efficient way to do that? Grouping is usually slow.

Not tested yet, but here are few things you can try:
Option 1: For the first approach, you can filter out the 4 by:
newdf.groupby('label2ix').min()
but this approach might not work with more general data.
Option 2: This might work better in general:
# copy all on and off switches to a common column
# 0 - off, 1 - on
df['state'] = np.select([df.label1=='T', df.label2=='T'], [1,0], default=np.nan)
# ffill will fill the na with the state before it
# until changed by a new switch
df['state'] = df['state'].ffill().fillna(0)
# mark the changes of states
df['change'] = df['state'].diff()
At this point, df will be:
index label1 label2 state change
0 1 F T 0.0 NaN
1 2 T F 1.0 1.0
2 3 F F 1.0 0.0
3 4 T F 1.0 0.0
4 5 F T 0.0 -1.0
5 6 F F 0.0 0.0
6 7 T F 1.0 1.0
7 8 F F 1.0 0.0
8 9 F T 0.0 -1.0
9 10 F F 0.0 0.0
10 11 F T 0.0 0.0
which should be easy to track all the state changes:
switch_ons = df.loc[df['change'].eq(1), 'index']
switch_offs = df.loc[df['change'].eq(-1), 'index']
# return df
new_df = pd.DataFrame({'label1ix':switch_ons.values,
'label2ix':switch_offs.values})
and output:
label1ix label2ix
0 2 5
1 7 9

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Write integer as well as float values to a text file - numpy

Related

Panda key value pair data frame

How to use dask dataframe instead of pandas to make a faster calculation

Computing JaroWinkler Similarity for unordered and different sized dataframes

plot dataframe column on one axis and other columns as separate lines on the same plot (in different color)

remove redundant signals in pandas

Categories

Resources