Skip to main content

Data Processing Algorithm Templates

Default Templates

DataFlow provides a variety of built-in data processing templates, including Basic Data Processing, Advanced Data Processing, and Data Augmentation templates. The platform will continue to expand with new templates to enhance data processing capabilities.

Users can also create custom templates or modify existing ones to build personalized data processing pipelines tailored to specific needs.

Default Template

Creating a New Algorithm Template

  • Modify a Built-in Template: Click Copy on a built-in template card to open the template creation page.

  • Create a Template from Scratch: Click Custom Template item, nav to the Custom Template page, and click the + Create button.

Create New Template

Fill in the Template Name, Task Type, and Template Description fields, and select the necessary operators and their execution order.

Note: Some operators require parameter configuration.

Once configured, click Creation Completed to start using the new template for data processing tasks.

Template Field Select Ops

Operators Supported by the Platform

IDNameTypeDescription
chinese_convert_mapperChinese ConverterMapperMapper to convert Chinese between Traditional Chinese, Simplified Chinese and Japanese Kanji.
clean_copyright_mapperCopyright CleanerMapperMapper to clean copyright comments at the beginning of the text samples.
clean_email_mapperEmail CleanerMapperMapper to clean email in text samples.
clean_html_mapperHTML Code CleanerMapperMapper to clean html code in text samples.
clean_ip_mapperIP CleanerMapperMapper to clean ipv4 and ipv6 address in text samples.
clean_links_mapperLink CleanerMapperMapper to clean links like http/https/ftp in text samples.
expand_macro_mapperExpand Macro DefinitionsMapperMapper to expand macro definitions in the document body of Latex samples.
generate_code_qa_pair_mapperConvert code to QA pairMapperMapper to generate new instruction data based on code.
extract_qa_mapperQA pair extractorMapperMapper to extract question and answer pair from text samples.
fix_unicode_mapperUnicode CorrectorMapperMapper to fix unicode errors in text samples.
nlpaug_en_mapperEnglish AugmentMapperMapper to simply augment samples in English based on nlpaug library.
nlpcda_zh_mapperChinese AugmentMapperMapper to simply augment samples in Chinese based on nlpcda library.
optimize_instruction_mapperInstruction OptimizerMapperMapper to optimize instruction.
punctuation_normalization_mapperUnicode Punctuations NormalizorMapperMapper to normalize unicode punctuations to English punctuations in text samples.
remove_bibliography_mapperBibliography CleanerMapperMapper to remove bibliography at the end of documents in Latex samples.
remove_comments_mapperComments CleanerMapperMapper to remove comments in different kinds of documents. Only support 'tex' for now.
remove_header_mapperRemove HeaderMapperMapper to remove headers at the beginning of documents in Latex samples.
remove_long_words_mapperLong Words CleanerMapperMapper to remove long words within a specific range.
remove_non_chinese_character_mapperNon Chinese CleanerMapperMapper to remove non chinese Character in text samples.
remove_repeat_sentences_mapperSentence De-duplicationMapperMapper to remove repeat sentences in text samples.
remove_specific_chars_mapperSpecific Chars CleanerMapperMapper to clean specific chars in text samples. now support: ◆●■►▼▲▴∆▻▷❖♡□
remove_table_text_mapperTable Texts CleanerMapperMapper to remove table texts from text samples. Regular expression is used to remove tables in the range of column number of tables.
remove_words_with_incorrect_substrings_mapperIncorrect Substring CleanerMapperMapper to remove words with incorrect substrings.
replace_content_mapperContent ReplacementMapperMapper to replace all content in the text that matches a specific regular expression pattern with a designated replacement string.
sentence_split_mapperSentence SpliterMapperMapper to split text samples to sentences.
whitespace_normalization_mapperWhitespace NormalizorMapperMapper to normalize different kinds of whitespaces to whitespace ' ' (0x20) in text samples.
alphanumeric_filterAlphabet/Numeric Ratio FilterFilterFilter to keep samples with alphabet/numeric ratio within a specific range.
average_line_length_filterAverage Line Length FilterFilterFilter to keep samples with average line length within a specific range.
character_repetition_filterChar-Level Repetition Ratio FilterFilterFilter to keep samples with char-level n-gram repetition ratio within a specific range.
flagged_words_filterFlagged-Word Ratio FilterFilterFilter to keep samples with flagged-word ratio less than a specific max value.
language_id_score_filterSpecific Language FilterFilterFilter to keep samples in a specific language with confidence score larger than a specific min value.
maximum_line_length_filterMaximum Line Length FilterFilterFilter to keep samples with maximum line length within a specific range.
perplexity_filterPerplexity Score FilterFilterFilter to keep samples with perplexity score less than a specific max value.
special_characters_filterSpecial-Char Ratio FilterFilterFilter to keep samples with special-char ratio within a specific range.
specified_field_filterSpecified Field Information FilterFilterFilter based on specified field information. If the specified field information in the sample is not within the specified target value, the sample will be filtered.
specified_numeric_field_filterSpecified Numeric Field FilterFilterFilter based on specified numeric field information. If the specified numeric information in the sample is not within the specified range, the sample will be filtered.
stopwords_filterStopword Ratio FilterFilterFilter to keep samples with stopword ratio larger than a specific min value.
suffix_filterSpecified Suffix FilterFilterFilter to keep samples with specified suffix.
text_action_filterTexts Contain Actions FilterFilterFilter to keep texts those contain actions in the text..
text_entity_dependency_filterTexts Containing Entities FilterFilterIdentify the entities in the text which are independent with other token, and filter them. The text containing no entities will be omitted.
text_length_filterTotal Text Length FilterFilterFilter to keep samples with total text length within a specific range.
token_num_filterTotal Token Number FilterFilterFilter to keep samples with total token number within a specific range.
word_repetition_filterWord-Level Repetition Ratio FilterFilterFilter to keep samples with word-level n-gram repetition ratio within a specific range.
words_num_filterTotal Words Number FilterFilterFilter to keep samples with total words number within a specific range.
document_deduplicatorDocument Deduplicator(MD5 Hash)DeduplicatorDeduplicator to deduplicate samples at document-level using exact matching.
Using md5 hash to deduplicate samples.
document_minhash_deduplicatorDocument Deduplicator(MinHashLSH)DeduplicatorDeduplicator to deduplicate samples at document-level using MinHashLSH.
Different from simhash, minhash is stored as bytes, so they won't be kept in the final dataset.
document_simhash_deduplicatorDocument Deduplicator(SimHash)DeduplicatorDeduplicator to deduplicate samples at document-level using SimHash.
frequency_specified_field_selectorSorted Frequency SelectorSelectorSelector to select samples based on the sorted frequency of specified field.
random_selectorRandom SelectorSelectorSelector to random select samples.
range_specified_field_selectorSorted Range SelectorSelectorSelector to select a range of samples based on the sorted specified field value from smallest to largest.
topk_specified_field_selectorTop Samples SelectorSelectorSelector to select top samples based on the sorted specified field value.
annotate_edu_train_bert_scorer_mapperEducational Evaluation ScoringMapperUses the Qwen2.5-14B model to evaluate the educational value of selected text and assign a score from 0 to 5.
text_high_score_filterHigh-Score Data FilteringFilterFilters out data with scores greater than 3.
text_bloom_filterBloom Filter DeduplicationFilterRemoves duplicates in the dataset using a Bloom filter.
make_cosmopedia_mapperStylized Data SynthesisMapperReads seed data, treats each seed as a topic, and uses a prompt template to specify style and genre. The data is then passed to a vLLM model to generate synthetic data in the specified style.
pipeline_magpie_zh_mapperSimilarity-Based DeduplicationMapperUses the DeepSeek-v2.5 or Qwen2.5 model with manually designed system prompts for multiple tasks to generate multi-turn dialogue data.
gather_generated_data_filterData Aggregation and CleaningFilterAggregates the generated data from the previous step, performs initial cleaning, and saves it as a file.
encode_and_get_nearest_mapperSample Encoding and Nearest SearchMapperUses the gte-large-zh model to encode the first user_query in each dialogue into an embedding vector, then searches for the nearest sample based on cosine similarity.
dedup_and_save_deduplicator_deduplicatorMulti-Turn Dialogue DeduplicationDeduplicatorDeduplicates results by keeping only one random entry from each similarity group, and saves them into different files by task type.