Tensorflow RNN PTB Example Walkthrough



Structure

There is only one graph, but there are three PTBModel instances to keep track of the elements related to the stage of train, valid and test. Note the use of variable_scope and the resue to share the weights/bias.

with tf.Graph().as_default():
  initializer = tf.random_uniform_initializer(-config.init_scale,
											  config.init_scale)

  with tf.name_scope("Train"):
	train_input = PTBInput(config=config, data=train_data, name="TrainInput")
	with tf.variable_scope("Model", reuse=None, initializer=initializer):
	  m = PTBModel(is_training=True, config=config, input_=train_input)
	tf.scalar_summary("Training Loss", m.cost)
	tf.scalar_summary("Learning Rate", m.lr)

  with tf.name_scope("Valid"):
	valid_input = PTBInput(config=config, data=valid_data, name="ValidInput")
	with tf.variable_scope("Model", reuse=True, initializer=initializer):
	  mvalid = PTBModel(is_training=False, config=config, input_=valid_input)
	tf.scalar_summary("Validation Loss", mvalid.cost)

  with tf.name_scope("Test"):
	test_input = PTBInput(config=eval_config, data=test_data, name="TestInput")
	with tf.variable_scope("Model", reuse=True, initializer=initializer):
	  mtest = PTBModel(is_training=False, config=eval_config,
					  input_=test_input)

A typical epoch loop follows. Each epoch iteration completes a full pass of the train, valid and test dataset. Note that the learning rate are kept constant in each epoch and only updated across epoch.

sv = tf.train.Supervisor(logdir=FLAGS.save_path)
with sv.managed_session() as session:
  for i in range(config.max_max_epoch):
	lr_decay = config.lr_decay ** max(i + 1 - config.max_epoch, 0.0)
	m.assign_lr(session, config.learning_rate * lr_decay)

	print("Epoch: %d Learning rate: %.3f" % (i + 1, session.run(m.lr)))
	train_perplexity = run_epoch(session, m, eval_op=m.train_op,
								 verbose=True)
	print("Epoch: %d Train Perplexity: %.3f" % (i + 1, train_perplexity))
	valid_perplexity = run_epoch(session, mvalid)
	print("Epoch: %d Valid Perplexity: %.3f" % (i + 1, valid_perplexity))

  test_perplexity = run_epoch(session, mtest)
  print("Test Perplexity: %.3f" % test_perplexity)

  if FLAGS.save_path:
	print("Saving model to %s." % FLAGS.save_path)
	sv.saver.save(session, FLAGS.save_path, global_step=sv.global_step)

Building the Graph

RNN is for sequence model. Each learning example is a pair of (input_data, target) where input_data and target are both sequence of the same length. Moreover, input_data and target are only differed by 1 in position. For example, Given a sequence (w_1,w_2,....), if input is (w_1,w_2,..,w_n), the target is (w_2,w_3, ..., w_n). Also, note that how RNN is unrolled. I have commented the parts worth attention.

class PTBModel(object):
  """The PTB model."""

  def __init__(self, is_training, config, input_):
	self._input = input_

	batch_size = input_.batch_size  #20
	num_steps = input_.num_steps #20, the length of the sequence in each learning example. 
	size = config.hidden_size #200-1500 depends on the config
	vocab_size = config.vocab_size

	# Slightly better results can be obtained with forget gate biases
	# initialized to 1 but the hyperparameters of the model would need to be
	# different than reported in the paper.
	lstm_cell = tf.nn.rnn_cell.BasicLSTMCell(size, forget_bias=0.0, state_is_tuple=True) #sgu: This line doesn't create the weights/bias insight the LSTM cell, not yet.

	#sgu: Dropout is applied for medium/large configuration
	if is_training and config.keep_prob < 1:
	  lstm_cell = tf.nn.rnn_cell.DropoutWrapper(
		  lstm_cell, output_keep_prob=config.keep_prob)
	cell = tf.nn.rnn_cell.MultiRNNCell([lstm_cell] * config.num_layers, state_is_tuple=True)

	self._initial_state = cell.zero_state(batch_size, data_type())

	with tf.device("/cpu:0"):
	  #sgu: embedding vector is shared across train, valid and test 
	  embedding = tf.get_variable(
		  "embedding", [vocab_size, size], dtype=data_type())
	  inputs = tf.nn.embedding_lookup(embedding, input_.input_data)

	if is_training and config.keep_prob  0: tf.get_variable_scope().reuse_variables()
		(cell_output, state) = cell(inputs[:, time_step, :], state)
		outputs.append(cell_output)

	#sgu: cell_output: (batch_size, embedding_size) 
	#sgu: output shape: (num_steps*batch_size, embedding_size)
	output = tf.reshape(tf.concat(1, outputs), [-1, size])
	softmax_w = tf.get_variable(
		"softmax_w", [size, vocab_size], dtype=data_type())
	softmax_b = tf.get_variable("softmax_b", [vocab_size], dtype=data_type())
	#sgu: logit shape: (num_step*batch_size, vocab_size)
	logits = tf.matmul(output, softmax_w) + softmax_b
	loss = tf.nn.seq2seq.sequence_loss_by_example(
		[logits],
		[tf.reshape(input_.targets, [-1])],
		[tf.ones([batch_size * num_steps], dtype=data_type())])
	self._cost = cost = tf.reduce_sum(loss) / batch_size
	#sgu: keep final state which is used as initial_state for next iteration(see run_epoch())
	self._final_state = state  

	if not is_training:
	  return

	self._lr = tf.Variable(0.0, trainable=False)
	tvars = tf.trainable_variables() 
	grads, _ = tf.clip_by_global_norm(tf.gradients(cost, tvars),
									  config.max_grad_norm)

	#sgu: When the learning rate self._lr changes, the optimizer picks up automatically  
	optimizer = tf.train.GradientDescentOptimizer(self._lr)
	self._train_op = optimizer.apply_gradients(
		zip(grads, tvars),
		global_step=tf.contrib.framework.get_or_create_global_step())
	self._new_lr = tf.placeholder(
		tf.float32, shape=[], name="new_learning_rate")
	self._lr_update = tf.assign(self._lr, self._new_lr)  

  def assign_lr(self, session, lr_value):
	session.run(self._lr_update, feed_dict={self._new_lr: lr_value})

Feeding the States of LSTM

The run_epoch run multiple iterations to pass the one full dataset. In each iteration, the stats of LSTM needs to be fed. Note that the last state of the current iterations are used as initial state of next iterations. Also, the learning rate are kept constant during one epoch.

def run_epoch(session, model, eval_op=None, verbose=False):
  """Runs the model on the given data."""
  start_time = time.time()
  costs = 0.0
  iters = 0
  state = session.run(model.initial_state)

  fetches = {
	  "cost": model.cost,
	  "final_state": model.final_state,
  }
  if eval_op is not None:
	fetches["eval_op"] = eval_op

  for step in range(model.input.epoch_size):
	feed_dict = {}
	#sgu: use the final state of the current mini-batch as the initial state of the subsequent minibatch
	#sgu: multiple LSTM cells can stack together. state[i] is the state of i-th cell.  
	for i, (c, h) in enumerate(model.initial_state):
	  feed_dict[c] = state[i].c
	  feed_dict[h] = state[i].h

	vals = session.run(fetches, feed_dict)
	cost = vals["cost"]
	state = vals["final_state"] 

	costs += cost
	iters += model.input.num_steps
	if verbose and step % (model.input.epoch_size // 10) == 10:
	  #sgu: the 1st : % of progress in current epoch;
	  #     the 2nd : perplexity
	  #     the 3rd:  words per sec  so far in the training 
	  print("%.3f perplexity: %.3f speed: %.0f wps" %
			(step * 1.0 / model.input.epoch_size, np.exp(costs / iters),
			 iters * model.input.batch_size / (time.time() - start_time)))

  return np.exp(costs / iters)
Advertisements

Study Note: Tensorflow



Tensor vs. Variable, Constant, Node

#code excerpt from https://github.com/tensorflow/tensorflow/issues/6322
# Hidden 1
with tf.name_scope('hidden1'):
  weights = tf.Variable(
      tf.truncated_normal([IMAGE_PIXELS, hidden1_units],
                        stddev=1.0 / math.sqrt(float(IMAGE_PIXELS))),
      name='weights')
   biases = tf.Variable(tf.zeros([hidden1_units]),name='biases')
hidden1 = tf.nn.relu(tf.matmul(images, weights) + biases)
//....
hidden1_outputs = tf.get_default_graph().get_tensor_by_name('hidden1/add:0')
global_step = tf.Variable(0, name='global_step', trainable=False)
train_op = optimizer.minimize(loss, global_step=global_step)

However, the “Variables” referred in the tensorflow doc are not the same as memory holding values (as in programming language). In a tf NN, many other tensors are not “Variable”. For example, hidden1_outputs in the snippet above is not a variable. To get all variables in the graph,

#for the MNIST network in tutorial: 
In [2]: lv = tf.get_collection(tf.GraphKeys.VARIABLES)
In [4]: [v.name for v in lv]
Out[4]: 
[u'hidden1/weights:0',
 u'hidden1/biases:0',
 u'hidden2/weights:0',
 u'hidden2/biases:0',
 u'softmax_linear/weights:0',
 u'softmax_linear/biases:0',
 u'global_step:0']

In [13]: lvt = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES)
In [14]: [v.name for v in lvt]
Out[14]: 
[u'hidden1/weights:0',
 u'hidden1/biases:0',
 u'hidden2/weights:0',
 u'hidden2/biases:0',
 u'softmax_linear/weights:0',
 u'softmax_linear/biases:0']
  • The term Tensor and Variable are used differently in Python and C++ API.

https://stackoverflow.com/questions/40866675/implementation-difference-between-tensorflow-variable-and-tensorflow-tensor

  • Why use tf.constant()?

For efficiency, readability(of the graph).
http://stackoverflow.com/questions/39512276/tensorflow-simple-operations-tensors-vs-python-variables

The shape of tensors

The first dimension of the input tensor and output tensor is the batch size.
Cf. https://stackoverflow.com/questions/39090222/tensorflow-single-value-vs-batch-tensors
In the MNIST example, the input tensor is of shape (100, 784) where 100 is the batch size. The output tensor is of shape (100,) since the target (\(y\)) is just one number indicating the class (0-9). Since input data is fed by a batch in each iteration, it’s typical to apply reduction when computing loss ( tf.reduce_mean, as shown below). The first dimension of the tensors of the hidden layer also matches the batch size(see the example below)

#examples/tutorials/mnist/fully_connected_feed.py
def placeholder_inputs(batch_size):
  """Generate placeholder variables to represent the input tensors.

  These placeholders are used as inputs by the rest of the model building
  code and will be fed from the downloaded data in the .run() loop, below.

  Args:
    batch_size: The batch size will be baked into both placeholders.

  Returns:
    images_placeholder: Images placeholder.
    labels_placeholder: Labels placeholder.
  """
  # Note that the shapes of the placeholders match the shapes of the full
  # image and label tensors, except the first dimension is now batch_size
  # rather than the full size of the train or test data sets.
  images_placeholder = tf.placeholder(tf.float32, shape=(batch_size,
                                                         mnist.IMAGE_PIXELS))
  labels_placeholder = tf.placeholder(tf.int32, shape=(batch_size))
  return images_placeholder, labels_placeholder

feed_dict = fill_feed_dict(data_set,images_placeholder,labels_placeholder)

In [1]: feed_dict
Out[1]: 
{<tf.Tensor 'Placeholder:0' shape=(100, 784) dtype=float32>: array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
        [ 0.,  0.,  0., ...,  0.,  0.,  0.],
        [ 0.,  0.,  0., ...,  0.,  0.,  0.],
        ..., 
        [ 0.,  0.,  0., ...,  0.,  0.,  0.],
        [ 0.,  0.,  0., ...,  0.,  0.,  0.],
        [ 0.,  0.,  0., ...,  0.,  0.,  0.]], dtype=float32),
 <tf.Tensor 'Placeholder_1:0' shape=(100,) dtype=int32>: array([5, 6, 5, 1, 3, 1, 2, 0, 9, 3, 5, 1, 9, 2, 2, 3, 6, 5, 4, 1, 6, 4, 9,
        9, 0, 0, 2, 8, 9, 2, 9, 9, 5, 9, 9, 4, 3, 7, 8, 5, 5, 1, 8, 5, 0, 3,
        8, 8, 1, 9, 3, 5, 0, 3, 2, 5, 6, 3, 6, 5, 7, 8, 7, 0, 8, 1, 6, 3, 3,
        4, 0, 8, 7, 7, 7, 5, 7, 6, 0, 5, 7, 5, 1, 3, 6, 0, 1, 1, 7, 7, 5, 5,
        1, 0, 3, 0, 9, 5, 0, 4], dtype=uint8)}

def loss(logits, labels):
  """Calculates the loss from the logits and the labels.

  Args:
    logits: Logits tensor, float - [batch_size, NUM_CLASSES].
    labels: Labels tensor, int32 - [batch_size].

  Returns:
    loss: Loss tensor of type float.
  """
  labels = tf.to_int64(labels)
  # sgu: cross_entropy.get_shape(): TensorShape([Dimension(100)]) where 100 is the batch_size 
  cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(
      labels=labels, logits=logits, name='xentropy') 
  return tf.reduce_mean(cross_entropy, name='xentropy_mean')

#sgu: due to the tf.reduce_mean, loss now is just a number(i.e., 0-D tensor)
loss = mnist.loss(logits, labels_placeholder) #loss.get_shape(): TensorShape([])

# Get the outputs before the ReLU.
# hidden1_outputs.get_shape(): TensorShape([Dimension(100), Dimension(128)])
# where 100 is the batch size and 128 is the dimension in the hidden1 layer,
# i.e., 128 features are extracted in hidden1
hidden1_outputs = tf.get_default_graph().get_tensor_by_name('hidden1/add:0')

In some case, the input tensor might be of rank 1 (train_inputs) and the output tensor (train_labels) would be rank 2. This is not a problem as long as the operations can consume them. In the sample code below, tf.nn.embedding_lookup and tf.nn.nce_loss require the arguments of those shapes.

#example/tutorials/word2vec/word2vec_basic.py. 
    #sgu train_inputs.get_shape(): TensorShape([Dimension(32)]) where 32 is the batch_size 
    train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
    train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1]) #nce_loss requires train_lables to be rank 2 
    embeddings = tf.Variable(
               tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))

    #sgu embed.get_shape(): TensorShape([Dimension(50000), Dimension(128)]) where 50000 is the vocab size and 128 is the embedding_size. 

    #sgu:embed.get_shape(): TensorShape([Dimension(32), Dimension(128)]) 
    #sgu: where 32 is the batch_size and 128 is the dimension size
    embed = tf.nn.embedding_lookup(embeddings, train_inputs) 

    # Construct the variables for the NCE loss
    nce_weights = tf.Variable(
        tf.truncated_normal([vocabulary_size, embedding_size],
                            stddev=1.0 / math.sqrt(embedding_size)))
    nce_biases = tf.Variable(tf.zeros([vocabulary_size]))

  #sgu: loss is a scalar, ie, a tensor of rank 0
  #loss.get_shape: TensorShape([])
  loss = tf.reduce_mean(
      tf.nn.nce_loss(weights=nce_weights,
                     biases=nce_biases,
                     labels=train_labels,
                     inputs=embed,
                     num_sampled=num_sampled,
                     num_classes=vocabulary_size))

Add operation node to the graph

The way the Tensorflow API is designed, library routines that create new operation nodes always attach nodes to the default graph. In fact, value+1 or reduce(value2) adds new nodes to the graph (as demonstrated below). Interactive debugging needs to be careful not to create unintended new nodes to the graph.

In [1]: import tensorflow as tf
In [2]: graph=tf.get_default_graph()
In [3]: graph.as_graph_def()
Out[3]: 
versions {
  producer: 17
}
In [4]: value = tf.constant(1)
In [5]: graph.as_graph_def()
Out[5]: 
node {
  name: "Const"
  op: "Const"
  attr {
    key: "dtype"
    value {
      type: DT_INT32
    }
  }
  attr {
    key: "value"
    value {
      tensor {
        dtype: DT_INT32
        tensor_shape {
        }
        int_val: 1
      }
    }
  }
}
versions {
  producer: 17
}
In [6]: value2=value+1
In [7]: graph.as_graph_def()
Out[7]: 
node {
  name: "Const"
  op: "Const"
  attr {
    key: "dtype"
    value {
      type: DT_INT32
    }
  }
  attr {
    key: "value"
    value {
      tensor {
        dtype: DT_INT32
        tensor_shape {
        }
        int_val: 1
      }
    }
  }
}
node {
  name: "add/y"
  op: "Const"
  attr {
    key: "dtype"
    value {
      type: DT_INT32
    }
  }
  attr {
    key: "value"
    value {
      tensor {
        dtype: DT_INT32
        tensor_shape {
        }
        int_val: 1
      }
    }
  }
}
node {
  name: "add"
  op: "Add"
  input: "Const"
  input: "add/y"
  attr {
    key: "T"
    value {
      type: DT_INT32
    }
  }
}
versions {
  producer: 17
}
In [8]: tf.reduce_mean(value)
Out[8]: <tf.Tensor 'Mean:0' shape=() dtype=int32>
In [9]: graph.as_graph_def()
Out[9]: 
node {
  name: "Const"
  op: "Const"
  attr {
    key: "dtype"
    value {
      type: DT_INT32
    }
  }
  attr {
    key: "value"
    value {
      tensor {
        dtype: DT_INT32
        tensor_shape {
        }
        int_val: 1
      }
    }
  }
}
node {
  name: "add/y"
  op: "Const"
  attr {
    key: "dtype"
    value {
      type: DT_INT32
    }
  }
  attr {
    key: "value"
    value {
      tensor {
        dtype: DT_INT32
        tensor_shape {
        }
        int_val: 1
      }
    }
  }
}
node {
  name: "add"
  op: "Add"
  input: "Const"
  input: "add/y"
  attr {
    key: "T"
    value {
      type: DT_INT32
    }
  }
}
node {
  name: "Const_1"
  op: "Const"
  attr {
    key: "dtype"
    value {
      type: DT_INT32
    }
  }
  attr {
    key: "value"
    value {
      tensor {
        dtype: DT_INT32
        tensor_shape {
          dim {
          }
        }
      }
    }
  }
}
node {
  name: "Mean"
  op: "Mean"
  input: "Const"
  input: "Const_1"
  attr {
    key: "T"
    value {
      type: DT_INT32
    }
  }
  attr {
    key: "Tidx"
    value {
      type: DT_INT32
    }
  }
  attr {
    key: "keep_dims"
    value {
      b: false
    }
  }
}
versions {
  producer: 17
}

Inspect the Graph

# Q: what happen when name_scope is used together with variable_scope?
with tf.name_scope("ns"):
    with tf.variable_scope("vs"):
        v1 = tf.get_variable("v1",[1.0]) 
        v2 = tf.Variable([2.],name="v2")
        v3 = v1+v2
v1.name  #vs/v1:0
v2.name  #ns/vs/v1:0
v3.name  #ns/vs/add:0 

#list all the node
l = [n for n in tf.get_default_graph().as_graph_def().node] 
[(ll.name,ll.op) for  ll in l]

In [12]: g  = tf.get_default_graph()
In [13]: op = g.get_operation_by_name("ns/vs/add")
In [14]: op.node_def
Out[14]: 
name: "ns/vs/add"
op: "Add"
input: "vs/v1/read"
input: "ns/vs/v2/read"
attr {
  key: "T"
  value {
    type: DT_FLOAT
  }
}
#get the output tensor of an op 
t = g.get_tensor_by_name("ns/vs/add:0") 
assert t==v3

Name Scope vs. Variable Scope:

with tf.name_scope("ns"):
    with tf.variable_scope("vs"):
        v1 = tf.get_variable("v1",[1.0]) 
        v2 = tf.Variable([2.],name="v2")
        v3 = v1+v2
v1.name  #vs/v1:0
v2.name  #ns/vs/v1:0
v3.name  #ns/vs/add:0

Sample code to visualization embedding vector

This is an sample code to visualize MNIST hidden vector in Tensorboard
https://github.com/tensorflow/tensorflow/issues/6322
This is an sample code to visualize word2vec in Tensorboard:
https://github.com/shiyuangu/tensorflow/blob/master/tensorflow/examples/tutorials/word2vec/word2vec_basic.py

ShellScript



Flow Control

for i in {1..12};do
   scp -i ~/tmp/tmp_pem hadoop@ec2host.com://path/to/file_${i}.Rda .
done
# or in one line
for i in {1..12}; do scp -i /tmp/tmp_pem hadoop@ec2host.com://path/to/file1_${i}.Rda . ;scp -i /tmp/tmp_pem hadoop@ec2host.com://path/to/file1_${i}_train.Rda . ; scp -i /tmp/tmp_pem hadoop@ec2host.com://path/to/file2_${i}.Rda . ; done

Piping

  • Piping with stdin redirection. Without the stdin redirection, emacs complains the stdin is not tty. The “$@” is a Special Shell Variables which is a string containing all arguments to the shell

    ls prod_JP*|xargs zsh -c 'emacs "$@"</dev/tty'
    

Utilities

find (Cf. 1):

  • -exec

    #The following commands do the same things: 
    find / -name core -exec /bin/rm -f '{}' \;
    find / -name core -exec /bin/rm -f '{}' +
    find / -name core | xargs /bin/rm -f
    
    • The ‑exec is followed by some shell command line, ended with a semicolon (“;”). (The semicolon must be quoted from the shell, so find can see it!) Within that command line, the word “{}” will expand out to the name of the found file. An alternate form of ‑exec ends with a plus-sign, not a semi-colon. This form collects the filenames into groups or sets, and runs the command once per set. (This is exactly what xargs does, to prevent argument lists from becoming too long for the system to handle.)1
    • To handle exotic filenames(which contain spaces or newlines), use the form “find .. -print0|xargs -0 …”
  • Gotcha: When no action is given, “-print” is assumed and grouping is perform. This could messed up the boolean operation precedence and give surprising results. See 1.

tar

  • Archive and gzip a whole directory:
tar czvf ShiyuanGu.tar.gz ShiyuanGu
  • Extract and uncompress:
tar xzvf ShiyuanGu.tar.gz ShiyuanGu

rsync

  • Rsync between remote and local through ssh
rsync -avzhe ssh --progress shiyuang@shiyuang.desktop.amazon.com://path/to/dir .
rsync -avzhe "ssh -i /full/path/to/pem" --progress localdir user@ec2host.amazonaws.com://mnt/

-a: archive
-v: verbose
-z: compress file data during transfer
-h: human readable info
-e ssh: connect through ssh(Use full path in -i)

sed

Features

  • Make only one pass over the inputs and hence more efficient.
  • Ability to filter text in a pipeline

Synopsis:

sed [OPTION]... {script} [input-file]
  • The script is actually the first non-option parameter, which sed specially considers a script and not an input file if (and only if) none of the other options specifies a script to be executed, that is if neither of the -e and -f options is specified. `man sed` calls it {script-only-if-no-other-script}.

  • When multiple -e, -f are givens, the script is understood to mean the in-order catenation Cf. here

  • sed maintains two spaces, the active pattern space, and the hold space. Both are initially empty. sed processes the input line by line. The input can be a file, a list of files or stdin, denoted by the character -. sed starts each cycle by reading in the line, removing any trailing newline and placing it in the pattern space. Then the script commands are executed. When all commands in the script are finished, the contents of the pattern space are printed out to the output stream, adding back the trailing newline if it was removed(well, some little nasty thing happens here Cf. sed manual footnote). The pattern space starts from fresh before each cycle starts(except for special commands, like D). The hold space, on the other hand, keeps its data between cycles (Cf. How sed works: Execution Cycle)

  • When the input is a list of files, by default, sed considers the files a single continuous long stream. GNU sed extension has a option `-s` allows user to consider them as separate files, and line numbers are relative to the start of each file (the catch for `-s` is that range address cannot span several files). For example, suppose sed-test-1.in and sed-test-2.in has two lines of “abcde”. The first command below only replaces the first line in first input file sed-test-1.in while the second command replaces the first line in each input file as well as the input from stdin

    echo "abcd"|sed -e "1 y/a/b/" sed-test-1.in sed-test-2.in -  
    echo "abcd"|sed -s -e "1 y/a/b/" sed-test-1.in sed-test-2.in -
    

Examples

  • Centering Lines with sed.

    In GNU sed manual, this script(https://www.gnu.org/software/sed/manual/sed.html#Centering-lines) is used to demonstrate how to center each line with sed. But this script is wrong and dangerous!. If a line is longer than 80 characters, the line will be silently truncated.

        #!/usr/bin/sed -f
        # Generate 80 spaces in the hold space
    # Note that hold space keeps its data between cycles,but we only need to generate the spaces once. 
    # The 1 is the address matches only the first line. 
        1 {
    x   #swap pattern space with hold space; since s operates on pattern space only. 
    s/^$/          /   #Replace empty line with 10 white spaces;  
    s/^.*$/&&&&&&&&/   # ^.*$ matches the previous inserted 10 white spaces; The & means the whole match. So we have 80 white spaces;  
    x   # swap pattern space with hold space. Now we have 80 spaces in the hold space. In the pattern space, we have the current line. 
        }
    
    # del leading and trailing spaces
    # The y command transliterate characters. 
    # Note that the y command, different from s,  allows no comments on the same line  
        y/\t/ /
        s/^ *// # delete all leading spaces  
        s/ *$// # delete all trailing spaces
    
    #The G command append a newline to the contents of the pattern space, 
    #and then append the contents of the hold space to that of the pattern space. 
        G      
    
        # keep first 81 chars (80 + a newline)
        # Note that . in sed matches newline. This is different from elisp or python
        # CAUTION: This command is dangerous, if the line is longer than 80 characters, we will lose anything after the 80th characters!
        s/^\(.\{81\}\).*$/\1/
    
        # \2 matches half of the spaces, which are moved to the beginning
        # The expression \(.*\)\2 captures necessary amount of spaces to fill in front, and divide the spaces equally into two parts. 
    # Grouping and subexpression matching allows us to do arithmetics. 
        s/^\(.*\)\n\(.*\)\2/\2\1/
    
  • Replace leading spaces with tabs.

    Sometimes we need to change indentation style, for example, replace every leading four spaces with one tab. The following one-liner will do the work. Note the use of label, conditional jump command t and grouping. The command t jumps to the label only when the substitution succeeds. The command t and the label forms an implicit loop where each iteration replaces a four spaces with a tab. For the meanings of `{}` and `+`, refer to the `find` utility here .

    find . -iname "*.ion" -exec sed -i ':l s/^\(\t*\) \{4\}/\1\t/; t l' {} +
    

Gotcha:

  • The s command and the g flags.

    The command ‘s/Hello/HelloWorld/’ only replace the first match of each line. To replace all matches, use the g flag, ‘s/Hello/HelloWorld/g’.

  • Word boundary

    Sometimes we think in terms of word and want the match for the whole word. In this case, use the world boundary “\b”

cut

  • Delete the third field with a common as field separator

    cut --complement -d, -f3 adult.data > income_train_data.csv
    

column

  • The following one-liner is taken from this post. It allows us to display s csv(comma-separated-value) file in a nice format using command lines

    cat file.csv | sed -e 's/,,/, ,/g' | column -s, -t | less -#5 -N -S
    

    The sed command is used to handle the corner case of empty field; the ‘g’ means replacing all matches not just the first; the `-s` in the `column` command is to specify delimiters; `-t` creates a table for pretty print; `-#` in the `less command `specifies the number of positions to scroll using right/leftarrow key; `-N` displays the line number and `-S` causes the line to be chopped rather than folded ( that is , `M-x toggle-truncate-lines` in Emacs). In case of tab separated file with missing field, we can replace tab with comma first, and then fill in the missing field with space,

    cat file.txt|sed -e 's/\t/,/g;s/,,/, ,/g'|column -s, -t|less -#5 -S
    

    The sed command ‘s/\t/,/g’ replace tab with comma and ‘s/,,/, ,/g’ fills in the missing field with a space. Note that without the fill-in step, the alignment is wrong.

awk

Randomly Sample 1/1000 row.

awk 'BEGIN {srand()} !/^$/ { if (rand() <= .001) print $0}' infile.txt> outfile.txt 
awk 'BEGIN {srand()} !/^$/ { if (rand() <= .01 || FNR==1) print $0}' infile.txt> outfile.txt #include header.

Exact a field and numeric comparison

## The following awk command extracts the lines with pval less than 0.05 from a csv file
## Sample line: 20150204,21,R,cp, All, 1, NTrig, pval,.04
## -F specifies field separator $0 is the whole line, $8,$9 is the eighth and ninth field. 
awk -F, '$8==" pval" && $9<0.05 {print $0}' inputfile.csv

Concatenate files with same header

## concatenate csv files with the header stripped-off except the first file. 
## FNR==1 && NR!=1: match the first line of a line except it's also the first line across all files(NR==1)
## getline: skip the line. 
## 1{print}: print everything except the lines previous skipped  
> awk '
FNR==1 && NR!=1 {getline;} 
1{print}
' input-blob*.csv > input-all.csv

## A more sophisticated example: the header spreads multiple line. 
> awk '
FNR==1 && NR!=1 {while (/^<header>/) getline;}  #skip the lines started with the <header> tags
1{print}
' input-blob*.csv > input-all.csv

Emacs-ipython-notebook



This is a note of my work flow using ipython notebook through Emacs package ein (EIN).

  1. Setup ein (Cf. myein.el or /path/to/ein/lisp/zeroein.el)
  2. In terminal, `cd /path/to/notebook` and `ipython notebook`. This runs a ipython notebook server.
  3. `M-x ein:notebooklist-open` and select the notebook file.
    We can run the notebook cell by cell in a interactive way.
  4. Connect to a notebook:
    EIN allows us to create/open a python source file in another buffer, and connect it to the notebook kernel. This allows sharing the kernel and auto-completion (powered by jedi)

    1. Switch to the buffer with the python source file .py
    2. `M-x ein:connect-to-notebook-command`. This will prompt us to select the already open notebook buffers and connect the current buffer to the notebook buffers. With prefix argument, the command also allows to select the notebook which are not open yet.
    3. Step (2) enable the minor mode ein:connect-mode which provides convenient keymaps for auto-completion, ect.
      • ‘.’ runs the ein:jedi-dot-complete command
      • C-c C-. or M-. runs ein:pytools-jump-to-source-command
      • *C-c C-,* or *M-,* ein:pytools-jump-back-command: Go back to the point where ein:pytools-jump-to-source-command is executed last time. When the prefix argument C-u is given, open the last point in the other window.
      • C-u C-c C-f runs the command ein:pytools-request-tooltip-or-help which shows the details info about the current content in the pager buffer. Without the prefix, the info is shown in a popup message box. The function ein:pytools-request-tooltip-or-help which calls the function ein:kernel-construct-help-string which would combine the info from docstring, init_docstring and docstring.
      • C-c C-/ ein:notebook-scratchsheet-open Open “scratch sheet”. Open a new one when prefix argument is given. Scratch sheet is almost identical to worksheet. However, EIN will not save the buffer. Use this buffer like of normal IPython console. Note that you can always copy cells into the normal worksheet to save result.

Source Code Digging: Emacs-Org-Babel



org-babel-execution-src-block:

When we call `C-c C-c` inside a source code block, the function `org-ctrl-c-ctrl-c` will be invoked. The function is content-aware and does different things depending on the context. If the cursor is inside a source code block, the org-babel-execution-src-block() will be invoked. The function org-babel-execution-src-block() extracts the src code and constructs necessary info for execution and then call another wrapper function(like org-babel-execute:python(…) ). This post is based on a debugger trace of the following toy python example. By digging this piece of elisp code, we can learn how to call processes from emacs and then retrieve the result.

##+BEGIN_SRC python :results output :session foo
print "hello"
23
print "bye"
##+END_SRC

The backtrace is as follow

Debugger entered--beginning evaluation of function call form:
 * (if info (copy-tree info) (org-babel-get-src-block-info))
 * (let* ((org-babel-current-src-block-location ....)
 * org-babel-execute-src-block(nil)
 (progn (org-babel-eval-wipe-error-buffer) (org-babel-execute-src-block current-prefix-arg) t)
 (if (or (org-babel-where-is-src-block-head) (org-babel-get-inline-src-block-matches)) (progn (org-babel-eval-wipe-error-buffer) (org-babel-execute-src-block current-prefix-arg) t) nil)
 org-babel-execute-src-block-maybe()
 (or (org-babel-execute-src-block-maybe) (org-babel-lob-execute-maybe))
 org-babel-execute-maybe()
 (if org-babel-no-eval-on-ctrl-c-ctrl-c nil (org-babel-execute-maybe))
 org-babel-execute-safely-maybe()
 run-hook-with-args-until-success(org-babel-execute-safely-maybe)
 (cond ....)
 org-ctrl-c-ctrl-c(nil)
 call-interactively(org-ctrl-c-ctrl-c nil nil)

One essential piece of org-babel-execution-src-block is the following function which extracts src code and constructs the info from context. Note that org-babel-execute-src-block is initially passed in the argument nil when it’s called by org-ctrl-c-ctrl-c. After org-babel-get-src-block-info is called, org-babel-execute-src-block has necessary info for src code execution.

(if info (copy-tree info) (org-babel-get-src-block-info))

After this expression, info becomes something like this:

("python" "print \"hello\"
21
print \"bye\"" ((:colname-names) (:rowname-names) (:result-params "output" "replace") (:result-type . output) (:comments . "") (:shebang . "") (:cache . "no") (:padline . "") (:noweb . "no") (:tangle . "no") (:exports . "code") (:results . "output replace") (:session . "foo") (:hlines . "no")) "" nil 0 #<marker at 1 in test-org-python-fix.org>)

In particular, the header argument :result-params hold info about how the program would execute. This info is extracted and eventually used as follows:

(if (member "none" result-params)
          (progn
            (funcall cmd body params)
            (message "result silenced")
            (setq result nil))
          (setq result
              (let ((result (funcall cmd body params)))
                            (if (and (eq (cdr (assoc :result-type params))
                                         'value)
                                     (or (member "vector" result-params)
                                         (member "table" result-params))
                                     (not (listp result)))
                                (list (list result)) result)))
          ;;;post processing: write to function; handle special types of result like vector, table, ect.

In the snippet above, the variable cmd holds the function to call, which is usually a wrapper, something like “org-babel-execute:python”. The variable body hold the src code block in string, which is something like:

"print \"hello\" 
23
print \"bye\""

The variable params hold necessary info for the wrapper to process, which looks like

((:comments . "")
 (:shebang . "")
 (:cache . "no")
 (:padline . "")
 (:noweb . "no")
 (:tangle . "no")
 (:exports . "code")
 (:results . "replace output")
 (:hlines . "no")
 (:session . "foo")
 (:result-type . output)
 (:result-params "output" "replace")
 (:rowname-names)
 (:colname-names))

The logic for result of evaluation(Cf. http://orgmode.org/manual/Results-of-evaluation.html#Results-of-evaluation) is handled in the wrapper org-babel-execute:python (body params), body is the source code and params is as shown above.

(defun org-babel-execute:python (body params)
  "Execute a block of Python code with Babel.
This function is called by `org-babel-execute-src-block'."
  (let* ((session (org-babel-python-initiate-session
           (cdr (assoc :session params))))
         (result-params (cdr (assoc :result-params params)))
         (result-type (cdr (assoc :result-type params)))
     (return-val (when (and (eq result-type 'value) (not session))
               (cdr (assoc :return params))))
     (preamble (cdr (assoc :preamble params)))
     (org-babel-python-command
      (or (cdr (assoc :python params)) org-babel-python-command))
      (full-body
      (org-babel-expand-body:generic
       (concat body (if return-val (format "\nreturn %s" return-val) ""))
       params (org-babel-variable-assignments:python params)))
      (result (org-babel-python-evaluate
          session full-body result-type result-params preamble)))
(org-babel-reassemble-table
     result
     (org-babel-pick-name (cdr (assoc :colname-names params))
              (cdr (assoc :colnames params)))
     (org-babel-pick-name (cdr (assoc :rowname-names params))
              (cdr (assoc :rownames params))))))

A few point noteworthy:

  • the org-babel-python-command (usually set in .emacs, something like “ipython –no-banner –classic –no-confirm-exit”) can be overridden by the header argument :python. Also, the org-babel-python-command is both used for session or non-session.
    (org-babel-python-command
      (or (cdr (assoc :python params)) org-babel-python-command))
    
  • The scr code is expanded before evaluation
    (full-body
       (org-babel-expand-body:generic
        (concat body (if return-val (format "\nreturn %s" return-val) ""))
        params (org-babel-variable-assignments:python params)))
    

The src code body is next passed to the function org-babel-python-evaluate. Now we see that the concept of ‘session/non-session’ as in org-babel manual http://orgmode.org/manual/Results-of-evaluation.html#Results-of-evaluation actually means the code whether is passed to external process or an interpreter running as an interactive Emacs inferior process(the variable “session” in the src code actually is a string of the buffer name, like “* foo *”).

(defun org-babel-python-evaluate
  (session body &optional result-type result-params preamble)
  "Evaluate BODY as Python code."
  (if session
      (org-babel-python-evaluate-session
       session body result-type result-params)
    (org-babel-python-evaluate-external-process
     body result-type result-params preamble)))

Let’s look at what happen for non-session case. Babel would evaluate the src code body in the buffer with name session using comint.

(defun org-babel-python-evaluate-session
    (session body &optional result-type result-params)
  "Pass BODY to the Python process in SESSION.
If RESULT-TYPE equals 'output then return standard output as a
string.  If RESULT-TYPE equals 'value then return the value of the
last statement in BODY, as elisp."
  (let* ((send-wait (lambda () (comint-send-input nil t) (sleep-for 0 5)))
     (dump-last-value
      (lambda
        (tmp-file pp)
        (mapc
         (lambda (statement) (insert statement) (funcall send-wait))
         (if pp
         (list
          "import pprint"
          (format "open('%s', 'w').write(pprint.pformat(_))"
              (org-babel-process-file-name tmp-file 'noquote)))
           (list (format "open('%s', 'w').write(str(_))"
                 (org-babel-process-file-name tmp-file
                                                          'noquote)))))))
     (input-body (lambda (body)
               (mapc (lambda (line) (insert line) (funcall send-wait))
                 (split-string body "[\r\n]"))
               (funcall send-wait)))
         (results
          (case result-type
            (output
             (mapconcat
              #'org-babel-trim
              (butlast
               (org-babel-comint-with-output
                   (session org-babel-python-eoe-indicator t body)
                 (funcall input-body body)
                 (funcall send-wait) (funcall send-wait)
                 (insert org-babel-python-eoe-indicator)
                 (funcall send-wait))
               2) "\n"))
            (value
             (let ((tmp-file (org-babel-temp-file "python-")))
               (org-babel-comint-with-output
                   (session org-babel-python-eoe-indicator nil body)
                 (let ((comint-process-echoes nil))
                   (funcall input-body body)
                   (funcall dump-last-value tmp-file
                            (member "pp" result-params))
                   (funcall send-wait) (funcall send-wait)
                   (insert org-babel-python-eoe-indicator)
                   (funcall send-wait)))
               (org-babel-eval-read-file tmp-file))))))
    (unless (string= (substring org-babel-python-eoe-indicator 1 -1) results)
      (org-babel-result-cond result-params
    results
        (org-babel-python-table-or-string results)))))

The function above takes the src code body, splits it into lines and then inserts them to the interpreter buffer. This imitates human entering the src code in the buffer.

(input-body (lambda (body)
               (mapc (lambda (line) (insert line) (funcall send-wait))
                 (split-string body "[\r\n]"))
               (funcall send-wait)))

In case that :result is ‘value, the return result is based on the processing of the interpreter buffer(remove echo, ect). In case that :result is ‘output, only the result of last executed statement in the interpreter session is retrieved. All screen output is ignored.The underscore _ in python holds the result of last executed statement in an interactive interpreter session. We can also do pprint.

;;; The underscore _ in python hold the result of last executed statement in an interative interpreter session. 
(dump-last-value
      (lambda
        (tmp-file pp)
        (mapc
         (lambda (statement) (insert statement) (funcall send-wait))
         (if pp
         (list
          "import pprint"
          (format "open('%s', 'w').write(pprint.pformat(_))"
              (org-babel-process-file-name tmp-file 'noquote)))
           (list (format "open('%s', 'w').write(str(_))"
                 (org-babel-process-file-name tmp-file
                                                          'noquote)))))))

For non-session evaluation (that is, :session none), org-babel-eval will be invoke, and then org-babel-eval, org-babel–shell-command-on-region. The real work is done by the lisp function process-file.

(defun org-babel-python-evaluate-external-process
  (body &optional result-type result-params preamble)
  "Evaluate BODY in external python process.
If RESULT-TYPE equals 'output then return standard output as a
string.  If RESULT-TYPE equals 'value then return the value of the
last statement in BODY, as elisp."
  (let ((raw
         (case result-type
           (output (org-babel-eval org-babel-python-command
                                   (concat (if preamble (concat preamble "\n"))
                                           body)))
           (value (let ((tmp-file (org-babel-temp-file "python-")))
                    (org-babel-eval
                     org-babel-python-command
                     (concat
                      (if preamble (concat preamble "\n") "")
                      (format
                       (if (member "pp" result-params)
                           org-babel-python-pp-wrapper-method
                         org-babel-python-wrapper-method)
                       (mapconcat
                        (lambda (line) (format "\t%s" line))
                        (split-string
                         (org-remove-indentation
                          (org-babel-trim body))
                         "[\r\n]") "\n")
                       (org-babel-process-file-name tmp-file 'noquote))))
                    (org-babel-eval-read-file tmp-file))))))
    (org-babel-result-cond result-params
      raw
      (org-babel-python-table-or-string (org-babel-trim raw)))))

(defun org-babel-eval (cmd body)
  "Run CMD on BODY.
If CMD succeeds then return its results, otherwise display
STDERR with `org-babel-eval-error-notify'."
  (let ((err-buff (get-buffer-create " *Org-Babel Error*")) exit-code)
    (with-current-buffer err-buff (erase-buffer))
    (with-temp-buffer
      (insert body)
      (setq exit-code
        (org-babel--shell-command-on-region
         (point-min) (point-max) cmd err-buff))
      (if (or (not (numberp exit-code)) (> exit-code 0))
      (progn
        (with-current-buffer err-buff
          (org-babel-eval-error-notify exit-code (buffer-string)))
        nil)
    (buffer-string)))))

(defun org-babel--shell-command-on-region (start end command error-buffer)
  "Execute COMMAND in an inferior shell with region as input.

Stripped down version of shell-command-on-region for internal use
in Babel only.  This lets us work around errors in the original
function in various versions of Emacs.
"
  (let ((input-file (org-babel-temp-file "ob-input-"))
    (error-file (if error-buffer (org-babel-temp-file "ob-error-") nil))
    ;; Unfortunately, `executable-find' does not support file name
    ;; handlers.  Therefore, we could use it in the local case
    ;; only.
    (shell-file-name
     (cond ((and (not (file-remote-p default-directory))
             (executable-find shell-file-name))
        shell-file-name)
           ((file-executable-p
         (concat (file-remote-p default-directory) shell-file-name))
        shell-file-name)
           ("/bin/sh")))
    exit-status)
    ;; There is an error in `process-file' when `error-file' exists.
    ;; This is fixed in Emacs trunk as of 2012-12-21; let's use this
    ;; workaround for now.
    (unless (file-remote-p default-directory)
      (delete-file error-file))
    ;; we always call this with 'replace, remove conditional
    ;; Replace specified region with output from command.
    (let ((swap (< start end)))
      (goto-char start)
      (push-mark (point) 'nomsg)
      (write-region start end input-file)
      (delete-region start end)
      (setq exit-status
        (process-file shell-file-name input-file
              (if error-file
                  (list t error-file)
                t)
              nil shell-command-switch command))
      (when swap (exchange-point-and-mark)))

    (when (and input-file (file-exists-p input-file)
           ;; bind org-babel--debug-input around the call to keep
           ;; the temporary input files available for inspection
           (not (when (boundp 'org-babel--debug-input)
              org-babel--debug-input)))
      (delete-file input-file))

    (when (and error-file (file-exists-p error-file))
      (if (< 0 (nth 7 (file-attributes error-file)))
      (with-current-buffer (get-buffer-create error-buffer)
        (let ((pos-from-end (- (point-max) (point))))
          (or (bobp)
          (insert "\f\n"))
          ;; Do no formatting while reading error file,
          ;; because that can run a shell command, and we
          ;; don't want that to cause an infinite recursion.
          (format-insert-file error-file nil)
          ;; Put point after the inserted errors.
          (goto-char (- (point-max) pos-from-end)))
        (current-buffer)))
      (delete-file error-file))
    exit-status))

The function org-babel–shell-command-on-region constructs arguments to call process-file, which looks like the following in case of python,

process-file("/bin/bash" 
             "/var/folders/8b/fw04kzxx5wd83bvpd3j2chds1dvh0k/T/babel-17601AAS/ob-input-176011aW" 
              (t "/var/folders/8b/fw04kzxx5wd83bvpd3j2chds1dvh0k
               /T/babel-17601AAS/ob-error-17601Clc") 
              nil "-c" "ipython --no-banner --classic --no-confirm-exit")

The first argument is the shell-file-name, the second argument is the temp file holds the python src code, the third argument says separating the standard output stream from standard error stream, and the error goes to an temp file. The fourth argument nil means don’t redisplay the buffer as output is inserted (output seems still inserted in the buffer eventually). The rest of arguments are passed to “/bin/bash” verbatim, “/bin/bash -c ipython” means to use ipython to execute the input file (see `man bash` for the usage of option -c, see `ipython –help` for the ipython options).

Java vs. C++



This post is a summary of the syntactical differences between C++ and Java.

Exception:

  • In Java, only subclass derived from Throwable can be thrown; in C++, any object can be thrown.
  • Java has two type of exceptions: checked and unchecked; all exceptions in c++ are unchecked. C++ does have exception specification. But it’s not part of function signature and it’s not checked in complied time. At runtime, if the exception specification is not obeyed, std::unexpected is called (Cf. cppreference.com Exception Specification).

Enum

  • C++’s enum is just integer internally(roughly speaking) and can be mixed with integer, and hence not type safe. C++ 11 provides a type-safe enum class (cf. wiki-enum-c++)
enum color { red, yellow, green=20, blue };
color col = red;
int n = blue; // n == 21
  • Java’s Enum is actually a special compiler-generated class rather than an arithmetic type (cf. Wiki-Enumerated_Type). A classical usage is to use Enum for factory. For example, the follow LineSearchEnum is a factory class for creating various line search methods
  public enum LineSearchEnum {
    SIMPLE {
       @Override 
       public LineSearch newInstance() {
           return new SimpleLineSearch();
       }
    }, 
    ARMIJO {
      @Override
      public LineSearch newInstance() {
          return new ArmijoLineSearch();
      }
    }, 
    CUBIC {
      @Override
      public LineSearch newInstance() {
          return new CubicLineSearch();
      }
    },
    Wolfe {
        @Override
        public LineSearch newInstance() {
            return new WolfeLineSearch();
        }
    };
    public abstract LineSearch newInstance();
}

Private inheritance

Java doesn’t support private inheritance but C++ does. Private inheritance is one way to implement the logical relation of /”is-implemented-in-terms-of”/, for example, we can implement a set using stl::list. The book Effective C++ also gives a nice example: The widget needs some functions which are already implemented in a call Timer. However, it is inappropriate for client to call Timer::onTick(). Clients of Widget should not be able to invoke anything in Timer. Widget is not a timer and the client should not be able to invoke anything in Timer directly. Public inheritance is not appropriate in this case. Private inheritance provides one possible solution. The book Effective C++ provides an alternative approach of using private class and composition. This approach has two benefits: (1) The subclass of Widget has no way to override the virtual function Timer::onTick()(note that in the first approach of private inheritance, the subclass cannot call onTick(), but can override Timer::onTick(); (2) Minimize compilation dependency if we use a pointer to WidgetTimer.

class Timer{
  pubic:
  explicit Timer(int tickFrequency);
  virtual void onTick()const;

};
class Widget:private Timer(){
 private:
  virtual void onTick() const; 
};
/*Alternative Approach*/
class Widget{
  private: 
     class WidgetTimer:public Timer{
       public: 
          virtual void onTick()const; 
     };
  WidgetTimer timer; //or use pointer WidgetTimer* timer to minimize compilation dependency; 
};

Access control

  • Java provides fourth level of access control, no modifier(package-private), public, private and protected while c++ has only three levels: public, private and protected. C++ has a concept of “friend” class. A class can access its friend’s private members.
  • Both C++ and Java have “protected” access control(cpp access control , java access control, why does the protected modifier in Java allow access to other classes in the same package), but their meaning are not the same. In C++, protected members can be accessed only by the class’s friends and its subclasses. In Java, protected members can be access also by anyclass within the package and a subclass of its class in another package. For example, the following is a valid java. In c++, subclasses change the access control when overriding. In java, subclass can also change the access control when overriding but with restriction(AccessControl and Inheritance)
          //javac -cp /Users/shiyuang/learningnotes/java/HelloWorld/classes -d $CLASSPATH/mine *.java
          /*
      Base.java -- 
       */
    package mine;
    public abstract class Base {
        public abstract void draw();
    }
    /*
      HelloWorld.java -- 
       */
    package mine;
    public class HelloWorld {
        public static void main(String[] args){
            Derived d=new Derived();
            d.draw();
        }
    }
    /*
      Derived.java -- 
       */
    package mine; 
    public class Derived extends Base {
        @Override
        protected void draw(){
            System.out.println("In Derived::draw()-3");
        }
    }
    

Constructors

  • default constructor,
    In C++, there is always a default constructor(i.e., the constructor takes no argument) if we don’t provide any constructors. In Java, this is not true. One consequence is that if there is no default constructor, and a subclass’s constructor doesn’t call super(..), there will be an compilation error. super(…) must be the first statement in subclass’s constructors.
  • Copy Constructor

c++ provides copy constructors if we don’t provide one while java doesn’t. The following java code doesn’t compile.

// filename: Main.java

class Complex {

    private double re, im;

    public Complex(double re, double im) {
        this.re = re;
        this.im = im;
    }
}

public class Main {

    public static void main(String[] args) {
        Complex c1 = new Complex(10, 15);  
        Complex c2 = new Complex(c1);  // compiler error here
    }
}

Data Hiding and the keywork final

/*
  KeywordFinal.java --
  The following example shows the problem data member is hidden by the subclass.
  In java, methods are virtual, but data is not. 
   */
class Base{
    public int a = 10; 
}
class Derived extends Base{
    public int a = 20; 
}
public class DataHiding{
    public static void print(Base obj){
        System.out.println("obj.a="+ obj.a);
        obj.a = 30; 
    }
    public static void main(String[] args){
        Derived obj = new Derived();
        print(obj); // obj.a = 10; 
        System.out.println("obj.a="+ obj.a);// obj.a = 20;
        System.out.println("obj.a="+ ((Base)obj).a);// obj.a = 30;
        System.out.println("End");
    }
}

C++ behaves exactly the same.

/*
 DataHiding.cpp -- 
*/

#include <iostream>
using namespace std;
class Base{
public:
  int a = 10;
};
class Derived: public Base{
public:
  int a = 20; 
};
void print(Base& obj){
  cout<<"obj.a="<<obj.a<<"\n";
  obj.a = 30; 
}
int main(int argc, char* argv[]){
  Derived* obj = new Derived();
  print(*obj); //obj.a = 10; 
  cout<<"obj.a="<<obj->a<<"\n"; //obj.a=20;
  cout<<"obj.a="<<((Base*)obj)->a<<"\n";//obj.a=30; 
  return 0;

In Java, we can use the keyword “final” to prevent data being hidden by the subclasses. C++ does introduce contextual keyword final. However, it is only being used for methods to prevent being overridden by subclasses or used for a class to prevent from being further derived.

“final” variables(Java) vs. const(C++)

Cf. wiki-final variables in Java. Summary

  • Unlike the value of a constant, the value of a final variable is not necessarily known at compile time. Java’s final variables must be set in initializers or constructors.
  • Java’s final variables does not guarantee immutability.
    public final Position pos;
    

    The pos cannot change but its members can unless it’s members are final too.

    const Position & pos.
    

“Final” class and override

Java provides has a concept of final class which cannot be subclassed. There are also final methods and final variables. final methods cannot be overridden or hidden by subclasses. Java also has an annotation of @Override to allow compilation to detect whether a method actually overrides anything. The keywords final and override are only introduced to C++ in C++11. They are called contextual keywords which means whether override is a keyword or not depends on the context(for example, it’s valid to name a variable to be override). Also, override is a suffix instead of prefix. The following example is from cppreference.com
#+BEGIN_SRC c++
struct A
{
virtual void foo();
void bar();
};

struct B : A
{
void foo() const override; / Error: B::foo does not override A::foo
/
(signature mismatch)
void foo() override; / OK: B::foo overrides A::foo
void bar() override; /
Error: B::bar is not virtual
};

Initializers

  • Java allows initializer one-liner or blocks. Initializers are executed whenever an instance of a class is created, regardless of which constructor is used to create the instance. The following is an example Java Initializer
  class PrimeClass
  {
      private Scanner sc = new Scanner(System.in);
      public int x;
      {
          System.out.print("Enter the starting value for x: ");
          x = sc.nextInt();
      }
}
+ C++ use /initialization list/ which is part of the constructors.

Abstract Class vs. Interface

Both Java and C++ have the concept of abstract class. But the concept is not quite the same. Abstract class cannot be instantiated. In c++, abstract class are the classes with pure virtual functions. C++ allows multiple inheritance and virtual inheritance. Java distinguishes abstract class. A class can extend at most one abstract class but implement many interface. An interface cannot have data member and all methods are public.

Method Overriding

  • In java, all instance methods are virtual by default which is not the case in c++. C++, methods can be hiding or overriding. This turns out to be a source of errors. C++ introduces a new contextual keyword override to help.

Abstract methods(Java) vs. pure virtual functions(c++)

Atn abstract method in Java can only be public or protected, but not private. In c++, it’s perfectly valid to define a private virtual function. This c++ rule seems odd at first but perfectly sensible(for example, to implement Templated Method Pattern, Cf. Effective c++ Item 35. The following example compiles and runs. Note that we can even change the access control in the subclass for pure virtual function.

#include <iostream>
using namespace std;
class Base{
public:
  //The following line will generate compilation error since a constructor cannot call a pure virtual function
  //Base(){cout<<"In Base, calling Base::print()\n"; print();}
  void testPrint(){cout<<"In Base, calling Base::print()\n"; print();}
private:
  virtual void print()=0;
};
class Derived:public Base{
public:
  Derived(){print();}
  virtual void print() override {std::cout<<"In Derived::print\n";}// Note the use of keyword override 
};
int main(int argc, char* argv[]){
  Derived d;
  d.print();
  d.testPrint();
  return 0;
}

Destructors(C++) vs. Finalizer(Java)

  • In c++ destructors are automatically call in the reverse order of construction. In java, finalizer() are not automatically chained, subclass needs to call super.final() explicitly.

Concurrency

Templating(C++) vs. Parameterized Type

  • primitive type can be passed as type parameter of a template in c++; but primitive cannot be used as type parameter in Java. Java introduced boxed primitive to resolve the problem.

Subclass Covariant

  • In c++, passing by value and passing by reference are different.

When passing by value, a single argument constructor will be called for implicit conversion (Cf. typecasting in C++). As a consequence, polymorphism is lost.

#include <iostream>
using namespace std;
class Derived;
class Base{
public:
  virtual void print(){cout<<"In Base\n";}
  Base(){cout<<"In default Base ctor\n";}
  Base(const Derived& derived) {cout<<"In non-default ctor\n";} 
};
class Derived:public Base{
public:
  virtual void print()override{cout<<"In Derived\n";}
};
void print(Base b){
  b.print();
}
int main(int argc, char* argv[]){
  Derived d;
  print(d); 
  return 0;
}

//output : In default Base ctor
// In non-default ctor
//In Base
  • In java uses the concept of covariant, and no constructor is called. It’s just like pass by reference in c++.
   class Base{
    public Base(Derived d) {System.out.println("In Base non-default ctor");}
    public Base(){System.out.println("In Base default ctor");}
    public void print(){System.out.println("In Base print ctor");}  
}
class Derived extends Base{
    public void print(){System.out.println("In Derived");}
}
public class Covariant{
    public static void print(Base obj){
        obj.print();
    }
    public static void main(String[] args){
        Derived d = new Derived();
        print(d);
    }
}
//output: In Base default ctor
//output: In Derived

Assignment =

  • In Java, the assignment means different things for primitive type and reference type. For primitive type, “=” means copy; for reference type, “=” means “bound”
/*
  assignment3.java --
  assignment means different things for primitive types and reference types.
  For primitive type, "=" means copy; for reference type, "=" means "bound"
   */

//For primitive type, "=" means copy 
public class assignment3 {
    public static void main(String[] args){
        int a = 10;
        int b = a;
        System.out.println("a=" + a); //a=20;
        System.out.println("b=" + b); // b=20;
        a=20;
        System.out.println("a=" + a); // a=20;
        System.out.println("b=" + b); // b=10;
        System.out.println("End");
    }
}

//For reference type, "=" means "bound"
class A{
    public int a =10;
}
public class assignment2 {
    public static void main(String[] args){
        A a = new A();
        A b = a;
        System.out.println("a.a:" + a.a);
        System.out.println("b.a:" + b.a);
        a.a=20;
        System.out.println("a.a:" + a.a);
        System.out.println("b.a:" + b.a);
        System.out.println("End");
    }
}

Prompt String and Color Setup



In Bash, PS1 sets the prompt string. There are four kinds of prompt in bash PS1, PS2, PS3, PS4. “\” denotes escape character. In the follow example:

#!/usr/bin/bash
PS1="\[33[1;34m\][\u]\[33[1;32m\][\w]\[33[0m\]> "
  • \[ : begin a sequence of non-printing characters, which could be used to embed a terminal control sequence into the prompt
  • \] : end a sequence of non-printing characters
  • The sequence in the format 33[n1;n2m is a ANSI Escape Code for colors. 33 is an CSI(Control Sequence Introducer)
    • 33[1;34m changes foreground color(30+i) to bold(1) blue(i=4); 33[1;32m changes foreground color to bold(1) green(i=2); 33[0m resets all attributes
    • [\u] $USER enclosed by brackets; \w working directory.

Zsh sets colors differently than Bash1. ANSI Escape Code seems not recognized. In Zsh, set the variable PROMPT instead(see the script bellow). I prefer to use GNU ls and utility dircolor to set colors. On Macports, GNU ls and dircolors are renamed to gls and gdircolor. The following is my current setup in zsh. The script use gls and gdircolor if they are installed. Otherwise, fall back to Mac OS native LSCOLORS.

#setup colors
autoload -U colors && colors

#setup the prompt string with color
PROMPT="%{$fg_bold[green]%}[%d]%{$reset_color%}>"

#LS_COLORS and dircolors is part of GNU ls
#(which is part of coreutils package); 
#Macports rename them to gdircolors and gls, respectively.
#There is a nice tool to generate ls colors 
#for both BSD ls and GNU ls : http://geoff.greer.fm/lscolors/
#For Mac LSCOLORS syntax, Cf. 
#http://superuser.com/questions/324207/how-do-i-get-context-coloring-in-mac-os-x-terminal
# Ref. http://unix.stackexchange.com/questions/2897/clicolor-and-ls-colors-in-bash
#for GNU LS_COLORS: http://blog.twistedcode.org/2008/04/lscolors-explained.html
if whence gdircolors>/dev/null && whence gls>/dev/null && test -e ~/.dir_colors; then
    eval `gdircolors ~/.dir_colors`
    alias ls='gls --color'
    zstyle ':completion:*:default' list-colors ${(s.:.)LS_COLORS} #for autocomplete; see `man zshcompsys
else
    export CLICOLOR=1
    export LSCOLORS=exfxcxdxbxegedabagacad
fi

Footnotes: