how to count only partition date in hdfs? - awk

i have hdfs structure like below,
hadoop fs -ls /user/user/warehouse/x.db/table/dt=2022-06-05
hadoop fs -ls /user/user/warehouse/x.db/table/dt=2022-07-05
hadoop fs -ls /user/user/warehouse/x.db/table/dt=2022-08-05
hadoop fs -ls /user/user/warehouse/x.db/table/dt=2022-09-05
hadoop fs -ls /user/user/warehouse/x.db/table/dt=2022-10-05
i can get the count for a table using,
hadoop fs -count /user/user/warehouse/x.db/table/
13923 2183394 119438162997420 user/user/warehouse/x.db/table
i want to get the count for specified dt partition. For example, only within the range of 2022-06 to 2022-08.
please guide me.

Related

Automation shell script to copy big query dataset using transfer service command

I want to write an shell script which copy source datasets into target dataset through transfer command line.
Note: copy should happen at dataset level as we have thousands of datasets in bigquery.
Consider the following approach:
#!/bin/bash
Project_Id='input your project name'
Location='input your location'
Tar_Data_Set='input you targeted dataset name'
for Data_Set in $(bq ls -n 1000 --project_id=${Project_Id} --location=${Location}| sed -n '3,$p')
[ $? -ne 0 ] && echo "Input parameter error" && exit 1
do
for Table_Name in $(bq ls ${Data_Set}| awk '{if(NR>2){print $1}}')
do
echo "bq cp -f ${Project_Id}.${Data_Set}.${Table_Name} ${Project_Id}.${Tar_Data_Set}.${Table_Name};" &>/dev/null
done
done
Here, we assume that it copies tables under the same project id.

Can the bq CLI list only views and exclude tables?

Listing the views is as simple as:
bq ls project_id:dataset_id
This includes both views and tables. Is there a way to filter this to only show views? The --filter parameter only appears to work on datasets and transfer jobs.
References:
https://cloud.google.com/bigquery/docs/reference/bq-cli-reference#bq_ls
https://cloud.google.com/bigquery/docs/listing-views
You have two options here:
Querying the INFORMATION_SCHEMA.VIEWS (google will bill you minimum 10GiB):
SELECT TABLE_NAME FROM `PROJECT_NAME`.dataset_name.INFORMATION_SCHEMA.VIEWS ;
Using the bq utility in combination with grep or awk:
bq ls __dataset__ | grep -i VIEW
or with awk looking at the second column:
bq ls __dataset__ | awk '{ if($2 == "VIEW"){ print $1; } }'

How does piping handle multiple files in linux?

So a naive me wanted to parse 50 files using awk, so I did the following
zcat dir_with_50files/* > huge_file
cat huge_file | awk '{parsing}'
Of course, this was terrible because it would spend time creating a file, then consume a whole bunch of memory to pass along to awk.
Then a coworker showed me that I could do this.
zcat dir_with_50files/filename{0..50} | awk '{parsing}'
I was amazed that I would get the same results without the memory consumption.
ps aux also showed that the two commands ran in parallel. I was confused about what was happening and this SO answer partially answered my question.
https://stackoverflow.com/a/1072251/6719378
But if piping knows to initiate the second command after certain amount of buffered data, why does my naive approach consume so much more memory compared to the second approach?
Is it because I am using cat on a single file compared to loading multiple files?
you can reduce maximuml memory usage by zcat file by file
ex:
for f in dir_with_50files/*
do
zcat f | awk '{parsing}' >> Result.File
done
# or
find dir_with_50files/ -exec zcat {} | awk '{parsing}' >> Result.File \;
but it depend on your parsing
ok for modfying line, deleting, copying if there is no relation to previous items ( ex: sub( /foo/, "bar"))
bad for counter (ex: List[$2]++ ) or related (modification) (ex: NR != FNR {...}; ! List[$2]++ {...})

SSH recursively change all subfolders names to a specific name

i have hundreads of folders with a subfolder named "thumbs" under each folder. i need to change the "thumbs" subfolder name with "thumb", under each subfolder.
i tried
find . -type d -exec rename 's/^thumbs$/thumb/' {} ";"
and i run this in shell when i am inside the folder that contains all subfolders, and each one of these subfolders contains the "thumbs" folder that need to be renamed with "thumb".
well I ran that command and shell stayed a lot of time thinking, then i gave a CTRL+C to stop, but I checked and no folder was renamed under current directory, I dont know if i renamed folders outside the directory i was in, can someone tell me where i am wrong with the code?
Goal 1: To change a subfolder "thumbs" to "thumb" if only one level deep.
Example Input:
./foo1/thumbs
./foo2/thumbs
./foo2/thumbs
Solution:
find . -maxdepth 2 -type d | sed 'p;s/thumbs/thumb/' | xargs -n2 mv
Output:
./foo1/thumb
./foo2/thumb
./foo2/thumb
Explanation:
Use find to give you all "thumbs" folders only one level deep. Pipe the output to sed. The p option prints the input line and the rest of the sed command changes "thumbs" to "thumb". Finally, pipe to xargs. The -n2 option tells xargs to use two arguments from the pipe and pass them to the mv command.
Issue:
This will not catch deeper subfolders. You can't simple not use depth here because find prints the output from the top and since we are replacing things with sed before we mv, mv will result in a error for deeper subfolders. For example, ./foo/thumbs/thumbs/ will not work because mv will take care of ./foo/thumbs first and make it ./foo/thumb, but then the next output line will result in an error because ./foo/thumbs/thumbs/ no longer exist.
Goal 2: To change all subfolders "thumbs" to "thumb" regardless of how deep.
Example Input:
./foo1/thumbs
./foo2/thumbs
./foo2/thumbs/thumbs
./foo2/thumbs
Solution:
find . -type d | awk -F'/' '{print NF, $0}' | sort -k 1 -n -r | awk '{print $2}' | sed 'p;s/\(.*\)thumbs/\1thumb/' | xargs -n2 mv
Output:
./foo1/thumb
./foo2/thumb
./foo2/thumb/thumb
./foo2/thumb
Explanation:
Use find to give you all "thumbs" subfolders. Pipe the output to awk to print the number of '/'s in each path plus the original output. sort the output numerically, in reverse (to put the deepest paths on top) by the number of '/'s. Pipe the sorted list to awk to remove the counts from each line. Pipe the output to sed. The p option prints the input line and the rest of the sed command finds the last occurrence of "thumbs" and changes only it to "thumb". Since we are working with sorted list in the order of deepest to shallowest level, this will provide mv with the right commands. Finally, pipe to xargs. The -n2 option tells xargs to use two arguments from the pipe and pass them to the mv command.

Google BigQuery - how to drop table with bq command?

Google BigQuery - bq command enable you to create, load, query and alter table.
I did not find any documentation regarding dropping table, will be happy to know how to do it.
I found the bq tool much easier to implement instead of writing python interface for each command.
Thanks.
found it :
bq rm -f -t data_set.table_name
-t for table, -f for force, -r remove all tables in the named dataset
great tool.
Is there a way to bulk delete multiple tables? – activelearner
In bash, you can do something like:
for i in $(bq ls -n 9999 my_dataset | grep keyword | awk '{print $1}'); do bq rm -ft my_dataset.$i; done;
Explanation:
bq ls -n 9999 my_dataset - list up to 9999 tables in my dataset
| grep keyword - pipe the results of the previous command into grep, search for a keyword that your tables have in common
| awk '{print $1}' - pipe the results of the previous command into awk and print only the first column
Wrap all that into a for loop
do bq rm -ft my_dataset.$i; done; - remove each table from your dataset
I would highly recommend running the commands to list out the tables you want to delete before you add the 'do bq rm'. This way you can ensure you are only deleting the tables you actually want to delete.
UPDATE:
The argument -ft now returns an error and should be simply -f to force the deletion, without a prompt:
for i in $(bq ls -n 9999 my_dataset | grep keyword | awk '{print $1}'); do bq rm -f my_dataset.$i; done;
You can use Python code (on Jupyter Notebook) for the same purpose:
bigquery_client = bigquery.Client() #Create a BigQuery service object
dataset_id='Name of your dataset'
table_id='Table to be deleted'
table_ref = bigquery_client.dataset(dataset_id).table(table_id)
bigquery_client.delete_table(table_ref) # API request
print('Table {}:{} deleted.'.format(dataset_id, table_id))
if you want to delete complete dataset:
If dataset contains tables as well. And we want to delete dataset containing tables in one go the command is:
!bq rm -f -r serene-boulder-203404:Temp1 # It will remove complete data set along with the tables in it
If your dataset is empty then you can use the following command as well:
To use the following command make sure that you have deleted all the tables in that dataset otherwise, it will generate an error (dataset is still in use).
#Now remove an empty dataset using bq command from Python
!bq rm -f dataset_id
print("dataset deleted successfully !!!")
I used the command line for loop to delete a month of table data, but this is reliant on your table naming:
for %d in (01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31) DO bq rm -f -t dataset.tablename_201701%d
Expanding on the excellent answer from #james, I simply needed to remove all tables in a dataset but not actually remove the dataset itself. Hence the grep part was unnecessary for me however I still needed to get rid of the
table_id
------------------
header that bq returns when listing tables, for that I used sed to remove those first two lines:
for i in $(bq ls -n 9999 my_dataset | sed "1,2 d" | awk '{print $1}'); do bq rm -f my_dataset.$i; done;
perhaps there's a bq option to not return that header but if there is, I don't know it.