I want to remove 'hi' ,'by', 'dy' from col2 at one shot in sql. I'm very new to sql server, if anyone could give an outline how such problems are solved that would be really helpful.
Col1 col2 col3
A hi!abcd 123
B bypython 678
C norm 888
D dupty dy 999
output:
Col1 col2 col3
A abcd 123
B python 678
C norm 888
D dupty 999
Related
I have a df with several columns, and I want to remove rows that have today's date time.
col1 col2 col3 col4
ABC 2022-08-12 00:03:29.872 123 A1B2
BCD 2022-08-12 00:02:08.067 234 B1C2
CDE 2022-08-11 23:57:24.208 345 C1D2
DEF 2022-08-11 23:56:55.257 456 D1E2
expected result (assuming today's date is 12th august 2022):
col1 col2 col3 col4
CDE 2022-08-11 23:57:24.208 345 C1D2
DEF 2022-08-11 23:56:55.257 456 D1E2
I tried doing below
df[pd.to_datetime(df.col2, errors='coerce') < pd.to_datetime('today')]
but it is not working, I still get rows from today. Can someone please help me with this?
Use Series.dt.date with != Timestamp.date:
df = df[pd.to_datetime(df.col2, errors='coerce').dt.date != pd.to_datetime('today').date()]
print (df)
col1 col2 col3 col4
2 CDE 2022-08-11 23:57:24.208 345 C1D2
3 DEF 2022-08-11 23:56:55.257 456 D1E2
if I have two rows for the same ID, then I have to check for col2 and pick rows with values N and Q and skip the row with U. If there is single record with col2=U, then let it be. so for ID 123 and 555, output is with col2 N and Q resp.
ID Col1 Col2 Col3
123 AAA N true
123 BBB U true
000 AAA N true
222 CCC U false
555 FIC Q false
555 VAN U true
expected output is:
Expected output:
ID Col1 Col2 Col3
123 AAA N true
000 AAA N true
222 CCC U false
555 FIC Q false
how can I do this in pandas ?
in sql, I tried with having count(*)>1, and then picked these columns.
You can use this code:
df.drop_duplicates('ID')
Above code keep always first record. You can change this with last instead of first record.
df.drop_duplicates(subset='ID', keep="first")
df.drop_duplicates(subset='ID', keep="last")
or you may sort for any column and then using of drop_duplicates method. In this way, (by order Ascending or Descending) you may use keep="first" for Min or Max.
One simple approach is to sort your dataframe by Col2 ensuring that 'U' will end up last. There are several possibilities:
pandas.Categorical
This sets an ordered categorical type on Col2
categories = np.append(np.setdiff1d(df['Col2'], ['U']), ['U'])
df['Col2'] = pd.Categorical(df['Col2'], categories=categories, ordered=True)
df.sort_values(by='Col2').groupby('ID').first()
Split dataframe
This splits the dataframe in two based on the values of Col2 (not-U and U), and concatenates the two parts to ensure the U are at the end
pd.concat([df.query('Col2 != "U"'), df.query('Col2 == "U"')]).groupby('ID').first()
Custom sort order
This manually defines the sorting order from a list
custom_order = ['N', 'Q', 'Z', 'U']
custom_order_dict = dict(zip(custom_order, range(len(custom_order))))
df.sort_values(by='Col2', key=lambda x: x.map(custom_order_dict)).groupby('ID').first()
input
ID Col1 Col2 Col3
0 123 AAA N True
1 123 BBB U True
2 0 AAA N True
3 222 CCC U False
4 555 FIC Q False
5 555 VAN U True
6 777 UUU U False
7 777 ZZZ Z True
8 999 UUU U False
9 999 NNN N True
output
Col1 Col2 Col3
ID
000 AAA N True
123 AAA N True
222 CCC U False
555 FIC Q False
777 ZZZ Z True
999 NNN N True
I tried a solution with multiple steps. this might not be the best way to do it but I did not find any other solution.
First step:
Separate records/rows for multiple ID's
df_multiple_record=pd.concat(g for _, g in df.groupby("ID") if len(g) > 1)
Output:
ID Col1 Col2 Col3
123 AAA N true
123 BBB U true
555 FIC Q false
555 VAN U true
Second Step:
drop the record with col2='U'
df_drop_U=df_multiple_record[df_multiple_record['Col2']!='U']
output:
ID Col1 Col2 Col3
123 AAA N true
555 FIC Q false
third Step:
drop the duplicates on ID from main extract to get the records for single occurance of ID
df_single_record=df.drop_duplicates(subset=['ID'],keep=False)
output:
ID Col1 Col2 Col3
000 AAA N true
222 CCC U false
fourth step:
concatenate single record df with the df where we drop U
df_final=pd.concat([df_single_record,df_drop_U],ignore_index=True)
output:
ID Col1 Col2 Col3
000 AAA N true
222 CCC U false
123 AAA N true
555 FIC Q false
I would like awk or sed or any other display filter mechanisms in native shell to be able to remove space from string lines that a match & do not remove any space between the 2 strings (columns) when not matched. And then display the output.
My file:
# cat test2
col0 col1 col2 col3 col4 col5
ln1 abc def ghi jkl mno
ln2 abc defghi jkl mno pqr
My output I would like to be:
# <something to view/grep/filter with> test2
col0 col1 col2 col3 col4 col5
ln1 abc defghi jkl mno
ln2 abc defghi jkl mno pqr
I have tried multiple combinations of grep & awk & cut. But not able to do it. I am not good with sed, but I can try. I even tried to use an interim file i.e. echo output to some file & then grep. But I failed to do that too.
Edited with more of my requirement:
My biggest problem is that I can't predict where the space will be & what the contents of the row entries would be. So I would like sed to get the output not based on a specific string but instead based on column numbers.
My file:
# cat test2
col0 col1 col2 col3 col4 col5
ln1 abc def ghi jkl mno
ln2 abc defghi jkl mno pqr
ln3 aaa bbb ccc ddd eee
ln4 aaa bbbccc ddd eee
Output File:
# <something to view/grep/filter with> test2
col0 col1 col2 col3 col4 col5
ln1 abc def ghi jkl mno
ln2 abc defghi jkl mno pqr
ln3 aaa bbbccc ddd eee
ln4 aaa bbbccc ddd eee
sed 's/def ghi/defghi/' file
If that's not what you wanted then edit your question to clarify your requirements and provide input/output that better demonstrates your problem.
I have a HIVE table with 3 columns, col3 is array the output from the table is like this
select col1, col2, col3 from testing;
col1 col2 col3
xyx 123 ["xyz","Good investing","123","abc","Bad investing","006","port123","future investing","008","flaf4","good research investing","01"]
xyx 789 ["xyz","Good investing","789","flag1","Bad investing","006","port123","future investing","008"]
I want to parse the col3 so that the output like as following
xyx 123 "xyz","Good investing","123"
xyx 123 "abc","Bad investing","006","port123"
xyx 123 "future investing","008","flaf4"
xyx 123 "good research investing","01"
xyx 789 "xyz","Good investing","789",
xyx 789 "flag1","Bad investing","006",
xyx 789 "port123","future investing","008"
any help will be highly appreciated
-kb
Input:
File 1
col1 col2 col3 col4 col5 col6 col7
A 91 - E Abu 7 -
B 82 - f Anu 9 -
C 93 - G Aru 8 -
File 2
col1 col2 col3 col4 col5 col6 col7
A 91 - x Bob 7 -
B 82 - y Bag 9 -
C 93 - z Bui 8 -
File 3
col1 col2 col3 col4 col5 col6 col7
A 91 - T Can 7 -
B 82 - U Con 9 -
C 93 - V Cuu 8 -
Output Expected:
col1 col2 col3 col4 col5
A 91 Abu Bob Can
B 82 Anu Bag Cun
C 93 Aru Bui Cuu
I have three files having same data at col1 and 2. I need to print fifth column of all files along with first two column.
I am able to do using two files. So Can anyone help me to do with three and more files?
Here is one way using awk:
$ awk '
BEGIN {
SUBSEP = FS;
print "col1 col2 col3 col4 col5"
}
FNR>1 {
a[$1,$2] = (a[$1,$2]?a[$1,$2]FS$5:$5)
}
END {
for(x in a) print x, a[x]
}' file1 file2 file3
col1 col2 col3 col4 col5
C 93 Aru Bui Cuu
A 91 Abu Bob Can
B 82 Anu Bag Con
You can pipe the output to sort if you required sorted output. This does not limit to three files. It is scalable to n number of files. Just add the file names at the end or use * to glob to all files under a given directory.
Assuming all three files have same number of rows because of this sentence
I have three files having same data at col1 and 2.
awk 'BEGIN{OFS="\t";
getline<"file1";getline<"file2";getline<"file3";
print "col1","col2","col3","col4","col5";
while(1) {getline < "file1";a=$1;b=$2;c=$5;getline<"file2";d=$5;f=getline<"file3";e=$5;
if(!f)exit;print a,b,c,d,e}}'
Output:
col1 col2 col3 col4 col5
A 91 Abu Bob Can
B 82 Anu Bag Con
C 93 Aru Bui Cuu
This will discard first line of each file, then reads files line by line, printing desired fields.