Datatables comparison on 1 or 2 columns - vb.net

I have 2 datatables with different rows, one of the column is common
I want to find the common values which are found in table1 but not in table 2 by comparing the common column only...
EDIT - added tab structure
Table1
Col1 col2 col3 col4
1 z x y
2 a s d
3 3 2 1
4 ! # 4
table 2
col1 col2
1 q
2 w
3 e
4 t
5 %
6 y
result1
col1
1
2
3
4
result2
col1
5
6

Related

Pandas shift logic

I have a dataframe like:
col1 customer
1 a
3 a
1 b
2 b
3 b
5 b
I want the logic to be like this:
col1 customer col2
1 a 1
3 a 1
1 b 1
2 b 2
3 b 3
5 b 3
as you can see, if the customer has consistent values in col1, give it, if not, give the last consistent number which is 3
I tried using the df.shift() but I was stuck
Further Example:
col1
1
1
1
3
5
8
10
he should be given a value of 1 because that's the last consistent value for him!
Update
If you have more than one month, you can use this version:
import numpy as np
inc_count = lambda x: np.where(x.diff(1) == 1, x, x.shift(fill_value=x.iloc[0]))
df['col2'] = df.groupby('customer')['col1'].transform(inc_count)
print(df)
# Output
col1 customer col2
0 1 a 1
1 3 a 1
2 1 b 1
3 2 b 2
4 3 b 3
5 5 b 3
Maybe you want to increment a counter if the next row value following the current one:
# Same as df['col1'].diff().eq(1).cumsum().add(1)
df['col2'] = df['col1'].eq(df['col1'].shift()+1).cumsum().add(1)
print(df)
# Output
col1 customer col2
0 1 a 1
1 3 a 1
2 1 b 1
3 2 b 2
4 3 b 3
5 5 b 3
Or for each customer:
inc_count = lambda x: x.eq(x.shift()+1).cumsum().add(1)
df['col2'] = df['col2'] = df.groupby('customer')['col1'].transform(inc_count)
print(df)
# Output
col1 customer col2
0 1 a 1
1 3 a 1
2 1 b 1
3 2 b 2
4 3 b 3
5 5 b 3

Pandas read csv with repeating header rows

I have a csv file where the data is as follows:
Col1 Col2 Col3
v1 5 9 5
v2 6 10 6
Col1 Col2 Col3
x1 2 4 6
x2 1 2 10
x3 10 2 1
Col1 Col2 Col3
y1 9 2 7
i.e. there are 3 different tables with same headers laid on top of each other. I am trying to pythonically get rid of repeating header rows and get the following result:
Col1 Col2 Col3
v1 5 9 5
v2 6 10 6
x1 2 4 6
x2 1 2 10
x3 10 2 1
y1 9 2 7
I am not sure how to proceed.
You can read the data and remove the rows that are identical to the columns:
df = pd.read_csv('file.csv')
df = df[df.ne(df.columns).any(1)]
Output:
Col1 Col2 Col3
v1 5 9 5
v2 6 10 6
x1 2 4 6
x2 1 2 10
x3 10 2 1
y1 9 2 7
An alternative solution is to detect the repeated header rows first, and then use the skiprows=... argument in read_csv().
This has the downside of reading the data twice, but has the advantage that it allows read_csv() to automatically parse the correct datatypes, and you won't have to cast them afterwards using astype().
This example uses hard-coded column name for the first column, but a more advanced version could determine the header from the first row, and then detect repeats of that.
# read the file once to detect the repeated header rows
header_rows = []
header_start = "Col1"
with open('file.csv') as f:
for i, line in enumerate(f):
if line.startswith(header_start):
header_rows.append(i)
# the first (real) row should always be detected
assert header_rows[0] == 0
# skip all header rows except for the first one (the real one)
df = pd.read_csv('file.csv', skiprows=header_rows[1:])
Output:
Col1 Col2 Col3
v1 5 9 5
v2 6 10 6
x1 2 4 6
x2 1 2 10
x3 10 2 1
y1 9 2 7

Retrieve the rows in dataframe based on ascending order of a particular column value

i have a dataframe such as,
d={'col1' : ['A','A','B','B','C','C','C','D','D'],'col2': [1,3,2,3,1,2,3,3,3]}
df=pd.DataFrame(d)
df
col1 col2
0 A 1
1 A 3
2 B 2
3 B 3
4 C 1
5 C 2
6 C 3
7 D 3
8 D 3
I need to convert the above dataframe values as below,
col1 col2
0 A 1
1 B 2
2 C 1
3 D 3
The context is that i need to check col2 value in ascending order and extract the entire row and update in the dataframe such that if once it has retrieved the row , the row should be ignored for the next col2 value.

Updating pandas dataframe values assigns Nan

I have a dataframe with 3 columns: Col1, Col2 and Col3.
Toy example
d = {'Col1':['hello','k','hello','we','r'],
'Col2':[10,20,30,40,50],
'Col3':[1,2,3,4,5]}
df = pd.DataFrame(d)
Which gets:
Col1 Col2 Col3
0 hello 10 1
1 k 20 2
2 hello 30 3
3 we 40 4
4 r 50 5
I am selecting the values of Col2 such that the value in Col1 is 'hello'
my_values = df.loc[df['Col1']=='hello']['Col2']
this returns me a Series where I can see the values of Col2 as well as the index.
0 10
2 30
Name: Col2, dtype: int64
Now suppose I want to assign this values to a Col3.
I only want to replace those values(index 0 and 2), keeping the other values in Col3 unmodified
I tried:
df['Col3'] = my_values
But this assigns Nan to the other values (the ones where Col1 is not hello)
Col1 Col2 Col3
0 hello 10 10
1 k 20 NaN
2 hello 30 30
3 we 40 NaN
4 r 50 NaN
How can I update certain values in Col3 leaving the others untouched?
Col1 Col2 Col3
0 hello 10 10
1 k 20 2
2 hello 30 30
3 we 40 4
4 r 50 5
So, in short:
Having my_values I want to put them in Col3
Or just base on np.where
df['Col3']=np.where(df['Col1'] == 'hello',df.Col2,df.Col3)
If base on your myvalue
df.loc[my_values.index,'col3']=my_values
Or you can just do update
df['Col3'].update(my_values)

Select some columns based on WHERE in a dataframe

So, I am working with Blaze and wanted to perform this query on a dataframe:
SELECT col1,col2 FROM table WHERE col1 > 0
For SELECT *, this works: d[d.col1 > 0]. But I want col1 and col2 only rather than all columns. How should I go about it?
Thanks in advance!
Edit: Here I create d as: d = Data('postgresql://uri')
This also works: d[d.col1 > 0][['col1','col2']]
I think you can use first subset and then boolean indexing:
print (d)
col1 col2 col3
0 -1 4 7
1 2 5 8
2 3 6 9
d = d[['col1','col2']]
print (d)
col1 col2
0 -1 4
1 2 5
2 3 6
print (d[d.col1 > 0])
col1 col2
1 2 5
2 3 6
This is same as:
print (d[['col1','col2']][d.col1 > 0])
col1 col2
1 2 5
2 3 6