Add an HDFS connection

You can add a connection to the HDFS file system using ThoughtSpot DataFlow.

Follow these steps:

  1. Click Connections in the top navigation bar.

  2. In the Connections interface, click Add connection in the top right corner.

  3. In the Create Connection interface, select the Connection type.

  4. After you select the HDFS Connection type, the rest of the connection properties appear.

    Depending on your choice of authentication mechanism, you may use different properties.

    See Connection properties for details, defaults, and examples.

    Connection name

    Name your connection.

    Connection type

    Choose the Google BigQuery connection type.

    User

    Specify the user to connect to HDFS file system.

    This user must have data access privileges.

    For Hive security with simple, LDAP, and SSL authentication only.

    Hadoop distribution

    Provide the distribution of Hadoop being connected to.

    Mandatory field.

    Distribution version

    Provide the version of the Distribution chosen above.

    Mandatory field.

    Hadoop conf path

    By default, the system picks the Hadoop configuration files from the HDFS. To override, specify an alternate location. Applies only when using configuration settings that are different from global Hadoop instance settings.

    DFS HA configured

    Enables High Availability for HDFS.

    Optional field.

    DFS name service

    The logical name of given to HDFS nameservice.

    Mandatory field. For HDFS HA only.

    DFS name node IDs

    Provides the list of NameNode IDs separated by comma and DataNodes use this property to determine all the NameNodes in the cluster. XML property name is dfs.ha.namenodes.dfs.nameservices. For HDFS HA only.

    RPC address for namenode1

    To specify the fully-qualified RPC address for each listed NameNode and defined as dfs.namenodes.rpc-address.dfs.nameservices.name_node_ID_1>. For HDFS HA only.

    RPC address for namenode2

    To specify the fully-qualified RPC address for each listed NameNode and defined as dfs.namenode.rpc-address.dfs.nameservices.name_node_ID_2. For HDFS HA only.

    DFS host

    Specify the DFS hostname or the IP address.

    Mandatory field. For when not using HDFS HA.

    DFS port

    Specify the associated DFS port.

    Mandatory field. For when not using HDFS HA.

    Default HDFS location

    Specify the location for the default source/target location.

    Mandatory field.

    Temp HDFS location

    Specify the location for creating temp directory.

    Mandatory field.

    DFS security authentication

    Select the type of security being enabled.

    Mandatory field.

    Hadoop RPC protection

    Hadoop cluster administrators control the quality of protection using the configuration parameter hadoop.rpc.protection.

    Mandatory field. For DFS security authentication with Kerberos only.

    Hive principal

    Principal for authenticating hive services.

    Mandatory field.

    User principal

    To authenticate via a key-tab you must have supporting key-tab file which is generated by Kerberos Admin and also requires the user principal associated with Key-tab (Configured while enabling Kerberos).

    Mandatory field.

    User keytab

    To authenticate via a key-tab you must have supporting key-tab file which is generated by Kerberos Admin and also requires the user principal associated with Key-tab (Configured while enabling Kerberos).

    Mandatory field.

    KDC host

    Specify KDC Host Name where as KDC (Kerberos Key Distribution Center) is a service than runs on a domain controller server role (Configured from Kerberos configuration-/etc/krb5.conf).

    Mandatory field.

    Default realm

    A Kerberos realm is the domain over which a Kerberos authentication server has the authority to authenticate a user, host or service (Configured from Kerberos configuration-/etc/krb5.conf).

    Mandatory field.

  5. Click Create connection.