Text Processing Formulas
This article introduces you to the most commonly used formulas that can solve a wide variety of problems you may encounter when working with data, particularly in SQL and related environments. Whether you're working with text processing, numerical calculations, or date manipulation, these formulas will help you streamline your work and improve your efficiency. This guide is aimed at users who are familiar with basic SQL functions and are looking to optimize their queries.
Prerequisites
- Basic knowledge of SQL queries and functions is recommended to fully understand and apply these formulas.
Table of Contents
- CONCAT
- SPLIT
- ARRAY_TO_STRING or IFNULL
- REPLACE
- MULTIPLE_REGEXP_REPLACE
- LIKE and REGEXP_CONTAINS
- INITCAP
- LENGHT
- Quick Answers to Common Issues
- Related Articles
CONCAT
This function addresses the common need to combine data from different sources into a single, unified field. It concatenates one or more values into a single result, making it ideal for tasks like combining first and last names or merging address components.
How to use it: CONCAT(value1, value2 [, value3, etc.])
Issue: You have separate columns for first and last name and need to display the full name in a single field. Or, you need to combine address components (street, city, state) into a single address field for reporting or analysis.
Example: CONCAT("John", " ", "Doe") -- Result: "John Doe"
SPLIT
The SPLIT
function can be used to extract different categories or subcategories from a string when your data is organized hierarchically with delimiters, such as categories separated by " > " in product descriptions.
How to use it: SPLIT(column, delimiter)[SAFE_OFFSET(position)]
Tips: The offset numbering starts at zero. To extract the first element, use
SAFE_OFFSET(0)
, and so on.
Issue: Your product descriptions are organized into hierarchical categories separated by " > ", and you need to extract the most specific subcategory for targeted marketing efforts.
Example: SPLIT("Electronics > Audio > Headphones > Wireless", " > ")[SAFE_OFFSET(3)] -- Result: " Wireless"
ARRAY_TO_STRING or IFNULL
Although both ARRAY_TO_STRING
and IFNULL
deal with data manipulation, they serve distinct purposes and are used in different scenarios.
- ARRAY_TO_STRING: This function handles arrays, converting them into a readable string format.
- IFNULL: This function deals with individual values, specifically handling nulls by replacing them with a default value.
ARRAY_TO_STRING
How to use it: ARRAY_TO_STRING(array, delimiter)
Issue: You have an array of product categories that you want to display in a human-readable format.
Solution: Use ARRAY_TO_STRING to concatenate the array elements into a single string with a specified delimiter.ARRAY_TO_STRING(['Electronics', 'Clothing', 'Home Goods'], ', ') -- Result:"Electronics, Clothing, Home Goods"
IFNULL
How to use it: IFNULL(expression, null_replacement)
Issue: You are concatenating product names and categories, but some products have null values for their names or categories.
Solution: Use IFNULL to provide a default value for null entries before concatenation.CONCAT(IFNULL(Product_Name, "Unknown"), " - ", IFNULL(Category, "Uncategorized"))
Result: "Laptop - Electronics" or "Unknown - Office Supplies"
REPLACE
This function tackles the challenge of maintaining accuracy and consistency in textual data. It replaces a specific part of a text string with another value, enabling you to update product descriptions, correct typos, or standardize information across your dataset.
How to use it: REPLACE(column, old_value, new_value)
Issue: You need to update product descriptions with new information, correct spelling errors in customer reviews, or standardize inconsistent data entries, such as replacing abbreviations with full words.
Example: REPLACE(description, "old_product", "new_product")
MULTIPLE_REGEXP_REPLACE
This function performs multiple string replacements simultaneously using regular expressions. It simplifies cleaning and standardizing data by applying several find-and-replace operations in one step.
How to use it: MULTIPLE_REGEXP_REPLACE(column_to_modify, [('old_expression_1', 'new_expression_1'), ('old_expression_2', 'new_expression_2'), ...])
Issue: You have inconsistent data requiring several simultaneous replacements, such as standardizing abbreviations or correcting multiple common misspellings in a single column.
Example: MULTIPLE_REGEXP_REPLACE(text_field, [('dr.', 'doctor'), ('st.', 'street'), ('mrs.', 'madam')])
LIKE and REGEXP_CONTAINS
Both LIKE and REGEXP_CONTAINS are used for string matching, but they differ in their capabilities and complexity.
- LIKE: This operator provides basic pattern matching using wildcards '%' (matches any sequence of zero or more characters) and '_' (matches any single character).
- REGEXP_CONTAINS: This function supports more advanced pattern matching using regular expressions, allowing for complex pattern definitions and greater flexibility.
LIKE
How to use it: column LIKE 'pattern'
Issue: You need to find all products with names containing "shirt".
Solution: Use LIKE with the '%' wildcard to match any characters before and after "shirt"
Example: product_name LIKE '%shirt%'
REGEXP_CONTAINS
How to use it: REGEXP_CONTAINS(column, regexp)
Issue: You need to find all products with IDs starting with "A" followed by three digits.
Solution: Use REGEXP_CONTAINS with a regular expression to define the pattern.REGEXP_CONTAINS(product_id, '^A{3}$')
Tips: To learn more about REGEX regular expressions and their usage in BigQuery, refer to our dedicated article
INITCAP
This function addresses the need for consistent and visually appealing text formatting. It capitalizes the first letter of each word in a string, ensuring that names, titles, or any textual data adhere to a standardized presentation style.
How to use it: INITCAP(value)
Issue: You need to format names consistently in a report, ensure that product titles are displayed with proper capitalization, or standardize the presentation of textual data for improved readability.
Example: INITCAP("hello world")
Result: "Hello World"
LENGTH
This function counts the number of characters in a given string. It's useful for performing conditional checks based on the length of text fields.
How to use it: LENGTH(column)
Issue: You need to ensure a text field meets a minimum character length requirement, such as verifying that product descriptions are sufficiently detailed for display.
Example: LENGTH(product_description)
IF(LENGTH(product_description) < 150, product_title, product_description)
Quick Answers to Common Issues
My CONCAT
isn't adding a space between names sometimes. Why?
Ensure you've explicitly included a space as a separate value within your CONCAT
function: CONCAT(first_name, ' ', last_name)
. Also, check your source data for any unwanted leading or trailing spaces.
My SPLIT
function isn't returning the correct part of the string. What's likely wrong?
Double-check that the delimiter in your SPLIT
function exactly matches the separator in your data, including any spaces. Also, verify that the SAFE_OFFSET
value you're using corresponds to an actual element in the split string.
Why isn't my LENGTH
function giving me the expected character count?
Be mindful of any leading or trailing spaces in your data, as these will be counted as characters. Also, ensure you are applying the LENGTH
function to the correct column.
Related Articles
For further assistance with SmartFeeds, consider reviewing these articles:
For additional help, feel free to reach out via our Contact Us page.