Useful Linux Commands

User and Group management

Add an existing user to a group

$ usermod -G <groupname> <username>

Check members of a group

$ cat /etc/groups | grep -i <group name>

Check groups a user belong to

$ groups <username>

Package management

Base repo http://fedoraproject.org/wiki/EPEL

Update existing packages

$ sudo yum update

Search yum repos

$ sudo yum search <package name>

Install

$ sudo yum install <package name> -y

Yum download only

$ sudo yum install yum-downloadonly
$ yum install --downloadonly --downloaddir=<directory> <package>

Install from local rpm

$ sudo yum localinstall <package name> -y

Remove

$ sudo yum remove <package name>

View installed packages

$ sudo yum list installed

View configured repos list

$ sudo yum repolist -v
$ sudo ls -l /etc/yum.repos.d/

Find details of a installed package

$ sudo rpm -ql package-name

Add a new repo

- Download .repo file in /etc/yum.repos.d/

- Optionally add pgp key to verify the downloaded packages using rpm -import command.

Example: http://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_cdh5_install.html#topic_4_4_2

Tar and untar

Tar a directory and compress the tar file with gzip

$ tar -zcf anaconda3.tar.gz anaconda3

Untar

$ tar xf anaconda3.tar.gz

Explore Processes

View Running jvm processes

$ sudo jps -lv

View java thread dump

$ jstack <pid>

Find whether a given process is running

$ sudo ps -ef | grep -i <process name e.g. java>

Tools to automatically restart process daemons: daemontools, supervisor

Resource Utilization

View disk io for each processes

$ sudo yum install iotop -y
$ sudo iotop -o

Disk io at the disk level

$ sudo yum install sysstat -y
$ sudo iostat -dx 5

# Options #

d : Display the Disk utilization

x : Display extended statistics

5 : Interval in seconds

View mem and CPU utilization

$ top

View memory information

$ cat /proc/meminfo

View network utilization

$ yum install iftop -y
$ sudo iftop -n

There are several other commands. See the list below.

iftop -- shows current open connections and their transfer rates

vmstat -- shows virtual memory status (found in procps package)

iostat -- shows current IO transfer rate by devices (found in sysstat package)

dstat -- combines vmstat, iostat and iostat-like info for network IO

iotop -- like top, but the focus is on IO transfer rate

lsof -- get info from currently open files

fuser -- does a subset of lsof: identify processes and users using certain files

Upgrade gcc to 4.9

sudo yum install centos-release-scl
sudo yum install devtoolset-6
sudo yum install devtoolset-3-gcc devtoolset-3-gcc-c++
scl enable devtoolset-6 bash

Verify gcc version

$ gcc --version

gcc (GCC) 4.9.2 20150212 (Red Hat 4.9.2-6)
Copyright (C) 2014 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE

Services

Find exact name of available services

$ sudo ls -l /etc/init.d/<first few letters of name e.g. hadoop>*

Start/Stop/Restart/Find status of a service

$ sudo service <name of a service e.g. hadoop-hdfs-namenode> [start|stop|restart|status]

Start/Stop/Restart/Find status of a service that matches a pattern

$ for service in /etc/init.d/hadoop*; do sudo $service status;done

System Activity Information

$ sudo sar

Run a command every n seconds

$ watch -n 1 cat /proc/meminfo

Move file using rsync

$ rsync -avhuz --progress --rsh="ssh -p2222" Downloads training@localhost:~

You can also set environment variable for SSH port.

$ export RSYNC_CONNECT_PROG='ssh -p2222'


If you want to delete the files in the target that have been deleted at source, add --delete argument.


Pipe Commands

$ aws logs describe-log-groups --output text | cut -f 4 | while read -r line; do aws logs delete-log-group --log-group-name $line; done


Copy only .pdf files from one directory

$ rsync -avm --include='*.pdf' --include='*/' --exclude='*' --prune-empty-dirs /source /destination

Synchronize two directory every second

$ while sleep 1; do rsync -avuz /Users/user01/workspace/scala/heavy --exclude "target" user01@server01:/home/user01/workspace/scala; done

Clush

Clush is an open source tool that allows you to execute commands in parallel across the nodes in the cluster

$ sudo -i # Login as root
$ yum install clustershell -y
$ vi /etc/clustershell/groups

Add the cluster nodes

all: server[01-04]

Run commands to all nodes in the cluster

$ clush -a date

To copy file /etc/hadoop/conf/core-site.xml to all cluster nodes .

$ clush -a -c /etc/hadoop/conf/core-site.xml

Verify :

$ clush -ab  ls -l /etc/hadoop/conf/

Find external IP

$ dig +short myip.opendns.com @resolver1.opendns.com

Text Editing

Compare two files

$ vim -d file1 file2

Find and replace text

:%s/find_string/replace_with string/gc

Working with directories and files

View directory listing

$ ls 

Change directory

$ cd <dir name, absolute path or relative path>

Know current directory

$ pwd

Create directory

$ mkdir -p <directory you want to create... with -p creates directory layers> 

Move or Rename file/dir

$ mv <source path> <destination path>

Delete a file /dir

$ rm -rf <file/dir path>

Download multiple files using urls in a file

Create a file links.txt and put links in the file - one line per link. Then run the wget command.

$ wget -i links.txt

Download all files mentioned in links on a page

$ wget --no-clobber --convert-links --random-wait -r -p -E -e robots=off -U mozilla http://site/path/

How to Disable ipv6

in /etc/sysctl.conf : net.ipv6.conf.all.disable_ipv6 = 1

in /etc/sysconfig/network : NETWORKING_IPV6=no

in /etc/sysconfig/network-scripts/ifcfg-eth0 : IPV6INIT=”no”

disable iptables6 – chkconfig –level 345 ip6tables off

reboot

Test network speed between 2 machines

Machine 1: Start Netcat to Listen

$ nc -lk 2112 >/dev/null

Machine 2:

$ dd if=/dev/zero bs=16000 count=625 | nc -v <machine 1 ip/name> 2112

Use iperf tool

$ sudo install iperf -y

Machine 1:

$ iperf -s

Machine 2:

$ iperf -c <machine 1 address> -d

Complete guide: http://openmaniak.com/iperf.php

Test Internet speed using speedtest-cli

$ wget https://raw.githubusercontent.com/sivel/speedtest-cli/master/speedtest.py
$ python3 speedtest.py
Retrieving speedtest.net configuration...
Testing from ACT (106.51.31.233)...
Retrieving speedtest.net server list...
Selecting best server based on ping...
Hosted by E-Infrastructure & Entertainment India Pvt. Ltd (Bangalore) [4.55 km]: 12.169 ms
Testing download speed................................................................................
Download: 30.84 Mbit/s
Testing upload speed....................................................................................................
Upload: 31.62 Mbit/s

Test Disk-IO using dd (sequential read/write)

$ dd if=/dev/zero of=/tmp/testfile bs=1G count=1 oflag=direct
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 6.14984 s, 175 MB/s

Random Read/Write test using fio tool

$ sudo install fio 
$ fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75

Reference:

  • https://www.binarylane.com.au/support/solutions/articles/1000055889-how-to-benchmark-disk-i-o

Source a file as stream

Randomly subset from a text file and write to a new file. Size of the subset is randomly selected to a number within 10.

$ shuf -n $(($RANDOM % 10)) ~/tweets.small.json > /user/mapr/tweets_raw/$(date +%s).json

Looping in Shell

$ for i in `seq 1 10`; do echo $i; done


Looping with delay of 1 sec

$ for i in `seq 1 10`; do echo $i; sleep 1 ; done


Printing a file sequentially from a file with a delay of one second

$ for i in `seq 1 10`; do head -n $i ~/tweets.small.json | tail -n 1; sleep 1; done


Using grep and awk

$ head /data/ml-latest-small/movies.csv 
movieId,title,genres
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,Father of the Bride Part II (1995),Comedy
6,Heat (1995),Action|Crime|Thriller
7,Sabrina (1995),Comedy|Romance
8,Tom and Huck (1995),Adventure|Children
9,Sudden Death (1995),Action

Find lines that contain "comedy"

$ cat /data/ml-latest-small/movies.csv  | grep -i "comedy" | head
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,Father of the Bride Part II (1995),Comedy
7,Sabrina (1995),Comedy|Romance
11,"American President, The (1995)",Comedy|Drama|Romance
12,Dracula: Dead and Loving It (1995),Comedy|Horror
18,Four Rooms (1995),Comedy
19,Ace Ventura: When Nature Calls (1995),Comedy
20,Money Train (1995),Action|Comedy|Crime|Drama|Thriller

Show the movie title in the above output.

$ cat /data/ml-latest-small/movies.csv  | grep -i "comedy" | awk -F "," '{print $2}' | head
Toy Story (1995)
Grumpier Old Men (1995)
Waiting to Exhale (1995)
Father of the Bride Part II (1995)
Sabrina (1995)
"American President
Dracula: Dead and Loving It (1995)
Four Rooms (1995)
Ace Ventura: When Nature Calls (1995)
Money Train (1995)

Looks like some names have been clipped off.Possibly because the title contains comma, and the field splitted by that comma. Lets ask - does each contain 3 fields - id, title and genre?

$ cat /data/ml-latest-small/movies.csv  | grep -i "comedy" | awk -F "," 'NF != 3 {print}' | head
11,"American President, The (1995)",Comedy|Drama|Romance
54,"Big Green, The (1995)",Children|Comedy
58,"Postman, The (Postino, Il) (1994)",Comedy|Drama|Romance
119,"Steal Big, Steal Little (1995)",Comedy
141,"Birdcage, The (1996)",Comedy
144,"Brothers McMullen, The (1995)",Comedy
166,"Doom Generation, The (1995)",Comedy|Crime|Drama
203,"To Wong Foo, Thanks for Everything! Julie Newmar (1995)",Comedy
239,"Goofy Movie, A (1995)",Animation|Children|Comedy|Romance
255,"Jerky Boys, The (1995)",Comedy