In Apache Hive, xpath refers to a family of built-in User-Defined Functions (UDFs) used to parse and extract data from XML strings.
Since Hive is designed to handle structured data, and XML is semi-structured, these functions allow you to query XML content stored in a table column as if it were a standard field.
How it Works
The xpath functions use the standard XPath language syntax to navigate through elements and attributes in an XML document.
Common XPath Functions in Hive
Hive provides several variations depending on the data type you need to extract:
| Function | Description |
| xpath(xml, path) | Returns a list (array) of strings that match the XPath expression. |
| xpath_string(xml, path) | Returns the text content of the first matching node. |
| xpath_int(xml, path) | Returns the integer value of the first matching node. |
| xpath_double(xml, path) | Returns the floating-point value of the first matching node. |
| xpath_short / xpath_boolean | Returns the match cast to the specific numeric or boolean type. |
Practical Example
Imagine you have a table named web_logs with a column called xml_data containing the following string:
<bookstore>
<book category="cooking">
<title>Everyday Italian</title>
<price>30.00</price>
</book>
</bookstore>
1. Extracting a single value:
To get the title of the book, you would use:
SELECT xpath_string(xml_data, '/bookstore/book/title') FROM web_logs;
-- Result: "Everyday Italian"
2. Extracting an attribute:
To get the category attribute:
SELECT xpath_string(xml_data, '/bookstore/book/@category') FROM web_logs;
-- Result: "cooking"
3. Extracting multiple values into an array:
If there were multiple books and you wanted all prices:
SELECT xpath(xml_data, '/bookstore/book/price/text()') FROM web_logs;
-- Result: ["30.00", "25.00", ...]
Performance Tip
Parsing XML is computationally expensive compared to reading Parquet or ORC formats. If you find yourself using xpath functions frequently on large datasets, it is often more efficient to:
-
Parse the XML once during the ETL process.
-
Store the extracted values in separate, native Hive columns.

浙公网安备 33010602011771号