Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
Tips and Traps¶
It is suggested that you never use the
-skipTrashoption unless you are absolutely aware of what you are doing. I made mistakes a couple of times in the past to remove HDFS paths accidentally with the-skipTrashoption which means that those HDFS files couldn’t be recovered from trash.The HDFS command supports wildcards. However,
*represents all files/directories including hidden ones (which is different from Linux/Unix shell).The success file
_SUCCESSis generated when a Spark/Hadoop application succeed. It can be used to check whether the data produced by a Spark application is ready. The success file_HIVESUCCESSis generated when a Hive table is refreshed successfully. It can be used to check whether a Hive table is ready for consumption.
hadoop fs vs hadoop dfs vs hdfs dfs¶
hadoop fssupports generic file systems. It can be used when you are dealing with different file systems such as Local FS, HFTP FS, S3 FS, and others.hadoop fsis recommended when you work with differnt file systems at the same time.Both
hdfs dfsandhadoop dfsare very specific to HDFS. They work for operation relates to HDFS.hadoop dfshas been deprecated in favor ofhdfs dfs.hdfs dfsis recommended when you work with HDFS only.
cat - Print a File¶
:::bash
hdfs dfs -catchmod - Change Permission of Files/Directories¶
:::bash
hdfs dfs -chmod -R 777 /hdfs/pathchown - Change the Owner of Files/Directories¶
:::bash
hdfs dfs -chown new_owner /hdfs/pathNotice that a HDFS path (file or directory)
can only be removed by its owner.
Other users cannot remove the path even if the file permission of the path is set 777.
If a HDFS path under user A’s home (/user/A/) is changed to be owned by user B,
then neither A nor B can remove the path.
You have to change the owner of the path back to user A
and then use user A to remove the path.
count (for Quota)¶
:::bash
hdfs dfs -count -q -v -h /user/username
QUOTA REM_QUOTA SPACE_QUOTA REM_SPACE_QUOTA DIR_COUNT FILE_COUNT CONTENT_SIZE PATHNAME
16 K 7.9 K 3 T 3.0 T 27 8.1 K 573.8 M /user/usernameQUOTA is namespace quota,
i.e., the number of files you can store.
The directory /tmp has no quota limit. You can use it for storing files temporarily.
cp - Copy Files/Directories¶
:::bash
hdfs dfs -cp /user/saurzcode/dir1/abc.txt /user/saurzcode/dir2get - Download a File/Directory¶
:::bash
hdfs dfs -getgetmerge¶
:::bash
hdfs dfs -getmerge /hdfs/path /path/in/linuxmkdir¶
:::bash
hdfs dfs -mkdir [-p] /path/to/createmv - Move/Rename Files/Directories¶
Move/rename the source file/directory /path/to/source TO /path/to/des.
:::bash
hdfs dfs -mv /path/to/source /path/to/desMove the source file/directory /path/to/source INTO the directory /path/to/des.
That is,
move/rename the file/directory /path/to/source to /path/to/des/source.
:::bash
hdfs dfs -mv /path/to/source /path/to/des/put/copyFromLocal - Upload a file/directory to HDFS.¶
:::bash
hdfs dfs -put [-f]The option -f overwrite existing files on HDFS.
However,
a tricky misunderstanding might happend when you upload a directory using the following command.
:::bash
hdfs dfs -put -f /local/path/to/some_directory /hdfs/path/to/some_directorySupppose /hdfs/path/to/some_directory already exists,
it is not the directory /hdfs/path/to/some_directory itself get overwritten
but rather files in it get overwritten.
If files in /local/path/to/some_directory have diffrent names than files in /hdfs/path/to/some_directory
then nothing is overwritten.
This might not what you want and can get you bitten.
It is suggested that you always remove a directory manually using the command hdfs dfs -rm -r /hdfs/path/to/some_directory
if you intend to overwrite the whole directory.
tail - Show Last Lines of a File¶
:::bash
hdfs dfs -tail /user/saurzcode/dir1/abc.txtThe command
hdfs dfs -mkdirsupports the-poption similar to that of themkdircommand in Linux/Unix.Check size of a directory. However, the depth option is not supported currently.
:::bash hdfs dfs -du [-s] [-h] URI [URI …]Remove a directory in HDFS without making a backup in trash. This is a dangerous operation but it is useful when the directory that you want to remove is too big to place into the trash directory.
:::bash hdfs dfs -rm -r -skipTrash /tmp/item_desc
checksum¶
:::bash
hdfs dfs -checksum URLNotice that the checksum command on HDFS returns different result from the md5sum command on Linux.
setfacl¶
Grant permission to a user.
:::bash
hadoop fs -setfacl -R -m user:user_name:rwx /path/to/grant/permissionMerge Multiple Files¶
Use hadoop-streaming job (with single reducer) to merge all part files data to single hdfs file on cluster itself and then use hdfs get to fetch single file to local system.
:::bash
hadoop jar /usr/hdp/2.3.2.0-2950/hadoop-mapreduce/hadoop-streaming-2.7.1.2.3.2.0-2950.jar \
-Dmapred.reduce.tasks=1 \
-input "/hdfs/input/dir" \
-output "/hdfs/output/dir" \
-mapper cat \
-reducer catHadoop FS compress¶
http://
no update, have to update locally and upload to hadoop
by default 3 reps of each block of data, 3 reps is the best according to many discussions
hadoop is for large data of course
because of false tolorence/replication of data, you acutally use more space on Hadoop
master node (name node): data about data, primary and secondary master node, for reliable
data nodes (slave nodes), edge node, access point for the external applications, tools, and users that need to utilize the hadoop environment
edge nodes (gate to hadoop), name nodes, data nodes