ShellScript



Flow Control

for i in {1..12};do
   scp -i ~/tmp/tmp_pem hadoop@ec2host.com://path/to/file_${i}.Rda .
done
# or in one line
for i in {1..12}; do scp -i /tmp/tmp_pem hadoop@ec2host.com://path/to/file1_${i}.Rda . ;scp -i /tmp/tmp_pem hadoop@ec2host.com://path/to/file1_${i}_train.Rda . ; scp -i /tmp/tmp_pem hadoop@ec2host.com://path/to/file2_${i}.Rda . ; done

Piping

  • Piping with stdin redirection. Without the stdin redirection, emacs complains the stdin is not tty. The “$@” is a Special Shell Variables which is a string containing all arguments to the shell

    ls prod_JP*|xargs zsh -c 'emacs "$@"</dev/tty'
    

Utilities

find (Cf. 1):

  • -exec

    #The following commands do the same things: 
    find / -name core -exec /bin/rm -f '{}' \;
    find / -name core -exec /bin/rm -f '{}' +
    find / -name core | xargs /bin/rm -f
    
    • The ‑exec is followed by some shell command line, ended with a semicolon (“;”). (The semicolon must be quoted from the shell, so find can see it!) Within that command line, the word “{}” will expand out to the name of the found file. An alternate form of ‑exec ends with a plus-sign, not a semi-colon. This form collects the filenames into groups or sets, and runs the command once per set. (This is exactly what xargs does, to prevent argument lists from becoming too long for the system to handle.)1
    • To handle exotic filenames(which contain spaces or newlines), use the form “find .. -print0|xargs -0 …”
  • Gotcha: When no action is given, “-print” is assumed and grouping is perform. This could messed up the boolean operation precedence and give surprising results. See 1.

tar

  • Archive and gzip a whole directory:
tar czvf ShiyuanGu.tar.gz ShiyuanGu
  • Extract and uncompress:
tar xzvf ShiyuanGu.tar.gz ShiyuanGu

rsync

  • Rsync between remote and local through ssh
rsync -avzhe ssh --progress shiyuang@shiyuang.desktop.amazon.com://path/to/dir .
rsync -avzhe "ssh -i /full/path/to/pem" --progress localdir user@ec2host.amazonaws.com://mnt/

-a: archive
-v: verbose
-z: compress file data during transfer
-h: human readable info
-e ssh: connect through ssh(Use full path in -i)

sed

Features

  • Make only one pass over the inputs and hence more efficient.
  • Ability to filter text in a pipeline

Synopsis:

sed [OPTION]... {script} [input-file]
  • The script is actually the first non-option parameter, which sed specially considers a script and not an input file if (and only if) none of the other options specifies a script to be executed, that is if neither of the -e and -f options is specified. `man sed` calls it {script-only-if-no-other-script}.

  • When multiple -e, -f are givens, the script is understood to mean the in-order catenation Cf. here

  • sed maintains two spaces, the active pattern space, and the hold space. Both are initially empty. sed processes the input line by line. The input can be a file, a list of files or stdin, denoted by the character -. sed starts each cycle by reading in the line, removing any trailing newline and placing it in the pattern space. Then the script commands are executed. When all commands in the script are finished, the contents of the pattern space are printed out to the output stream, adding back the trailing newline if it was removed(well, some little nasty thing happens here Cf. sed manual footnote). The pattern space starts from fresh before each cycle starts(except for special commands, like D). The hold space, on the other hand, keeps its data between cycles (Cf. How sed works: Execution Cycle)

  • When the input is a list of files, by default, sed considers the files a single continuous long stream. GNU sed extension has a option `-s` allows user to consider them as separate files, and line numbers are relative to the start of each file (the catch for `-s` is that range address cannot span several files). For example, suppose sed-test-1.in and sed-test-2.in has two lines of “abcde”. The first command below only replaces the first line in first input file sed-test-1.in while the second command replaces the first line in each input file as well as the input from stdin

    echo "abcd"|sed -e "1 y/a/b/" sed-test-1.in sed-test-2.in -  
    echo "abcd"|sed -s -e "1 y/a/b/" sed-test-1.in sed-test-2.in -
    

Examples

  • Centering Lines with sed.

    In GNU sed manual, this script(https://www.gnu.org/software/sed/manual/sed.html#Centering-lines) is used to demonstrate how to center each line with sed. But this script is wrong and dangerous!. If a line is longer than 80 characters, the line will be silently truncated.

        #!/usr/bin/sed -f
        # Generate 80 spaces in the hold space
    # Note that hold space keeps its data between cycles,but we only need to generate the spaces once. 
    # The 1 is the address matches only the first line. 
        1 {
    x   #swap pattern space with hold space; since s operates on pattern space only. 
    s/^$/          /   #Replace empty line with 10 white spaces;  
    s/^.*$/&&&&&&&&/   # ^.*$ matches the previous inserted 10 white spaces; The & means the whole match. So we have 80 white spaces;  
    x   # swap pattern space with hold space. Now we have 80 spaces in the hold space. In the pattern space, we have the current line. 
        }
    
    # del leading and trailing spaces
    # The y command transliterate characters. 
    # Note that the y command, different from s,  allows no comments on the same line  
        y/\t/ /
        s/^ *// # delete all leading spaces  
        s/ *$// # delete all trailing spaces
    
    #The G command append a newline to the contents of the pattern space, 
    #and then append the contents of the hold space to that of the pattern space. 
        G      
    
        # keep first 81 chars (80 + a newline)
        # Note that . in sed matches newline. This is different from elisp or python
        # CAUTION: This command is dangerous, if the line is longer than 80 characters, we will lose anything after the 80th characters!
        s/^\(.\{81\}\).*$/\1/
    
        # \2 matches half of the spaces, which are moved to the beginning
        # The expression \(.*\)\2 captures necessary amount of spaces to fill in front, and divide the spaces equally into two parts. 
    # Grouping and subexpression matching allows us to do arithmetics. 
        s/^\(.*\)\n\(.*\)\2/\2\1/
    
  • Replace leading spaces with tabs.

    Sometimes we need to change indentation style, for example, replace every leading four spaces with one tab. The following one-liner will do the work. Note the use of label, conditional jump command t and grouping. The command t jumps to the label only when the substitution succeeds. The command t and the label forms an implicit loop where each iteration replaces a four spaces with a tab. For the meanings of `{}` and `+`, refer to the `find` utility here .

    find . -iname "*.ion" -exec sed -i ':l s/^\(\t*\) \{4\}/\1\t/; t l' {} +
    

Gotcha:

  • The s command and the g flags.

    The command ‘s/Hello/HelloWorld/’ only replace the first match of each line. To replace all matches, use the g flag, ‘s/Hello/HelloWorld/g’.

  • Word boundary

    Sometimes we think in terms of word and want the match for the whole word. In this case, use the world boundary “\b”

cut

  • Delete the third field with a common as field separator

    cut --complement -d, -f3 adult.data > income_train_data.csv
    

column

  • The following one-liner is taken from this post. It allows us to display s csv(comma-separated-value) file in a nice format using command lines

    cat file.csv | sed -e 's/,,/, ,/g' | column -s, -t | less -#5 -N -S
    

    The sed command is used to handle the corner case of empty field; the ‘g’ means replacing all matches not just the first; the `-s` in the `column` command is to specify delimiters; `-t` creates a table for pretty print; `-#` in the `less command `specifies the number of positions to scroll using right/leftarrow key; `-N` displays the line number and `-S` causes the line to be chopped rather than folded ( that is , `M-x toggle-truncate-lines` in Emacs). In case of tab separated file with missing field, we can replace tab with comma first, and then fill in the missing field with space,

    cat file.txt|sed -e 's/\t/,/g;s/,,/, ,/g'|column -s, -t|less -#5 -S
    

    The sed command ‘s/\t/,/g’ replace tab with comma and ‘s/,,/, ,/g’ fills in the missing field with a space. Note that without the fill-in step, the alignment is wrong.

awk

Randomly Sample 1/1000 row.

awk 'BEGIN {srand()} !/^$/ { if (rand() <= .001) print $0}' infile.txt> outfile.txt 
awk 'BEGIN {srand()} !/^$/ { if (rand() <= .01 || FNR==1) print $0}' infile.txt> outfile.txt #include header.

Exact a field and numeric comparison

## The following awk command extracts the lines with pval less than 0.05 from a csv file
## Sample line: 20150204,21,R,cp, All, 1, NTrig, pval,.04
## -F specifies field separator $0 is the whole line, $8,$9 is the eighth and ninth field. 
awk -F, '$8==" pval" && $9<0.05 {print $0}' inputfile.csv

Concatenate files with same header

## concatenate csv files with the header stripped-off except the first file. 
## FNR==1 && NR!=1: match the first line of a line except it's also the first line across all files(NR==1)
## getline: skip the line. 
## 1{print}: print everything except the lines previous skipped  
> awk '
FNR==1 && NR!=1 {getline;} 
1{print}
' input-blob*.csv > input-all.csv

## A more sophisticated example: the header spreads multiple line. 
> awk '
FNR==1 && NR!=1 {while (/^<header>/) getline;}  #skip the lines started with the <header> tags
1{print}
' input-blob*.csv > input-all.csv
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s