python 雨滴谱时间处理：仅保留连续12分钟有效，且总时长>60分钟的降水序列

我一堆txt，想保留保留连续12分钟有效，且总时长>60分钟的降水序列”这一数据筛选条件，是为了确保用于训练和测试的雨滴谱数据具有连续性和物理合理性。具体含义如下：

1. “连续12分钟有效”的含义

“有效”的定义：
指每1分钟的雨滴谱观测数据满足以下质量控制条件（论文第2.3节）：
- 雨滴直径
- 每个直径区间内的粒子数 ≥ 2，且每分钟总粒子数 ≥ 10。
- 雨滴下落速度与理论值的差异 ≤ 5 m/s。
“连续12分钟”的要求：
- 必须存在至少一段连续的12分钟时段，其中每一分钟的数据均有效（无缺失或无效数据）。
- 例如：某次降水持续100分钟，但只有第20–31分钟这12分钟的数据完全有效，其余时段存在无效数据，则保留该序列。
目的：
- 确保时间序列的连贯性，适合LSTM模型训练（需连续输入历史数据预测未来）。
- 避免因数据中断导致模型学习到不合理的突变规律。

2. “总时长>60分钟”的含义

“总时长”：指单次降水事件（如一次降雨过程）的总观测时长，包括有效和无效时段。
要求：总观测时长 > 60分钟，且其中至少包含一段连续的12分钟有效数据。
目的：
- 筛选出足够长的降水事件，避免短暂降水（如阵雨）的干扰。
- 保证数据量充足（论文最终从1,788,915分钟数据中筛选出6725个序列）。

3. 实际筛选示例

假设某次降水事件的分钟数据如下（✔表示有效，✖表示无效）：

时间（分钟）: 1 2 3 4 5 6 7 8 9 10 11 12 13 ... 60 61
有效性: ✖ ✖ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ... ✔ ✖

分析：
- 连续有效时段：第3–14分钟（共12分钟有效）。
- 总时长：61分钟（>60分钟）。
- 结果：保留该序列。

若某次降水总时长50分钟（<60分钟），即使有连续12分钟有效，也会被剔除。

代码：

#!usr/bin/env python
# -*- coding:utf-8 -*-
"""
@author: Suyue
@file: lianxi.py
@time: 2025/04/28
@desc:筛选至少包含12分钟连续降水的序列且总时长大于等于60分钟的序列
"""
import os
import glob
from datetime import datetime, timedelta


def process_raindrop_data(file_pattern, output_dir):
    """
    处理雨滴谱数据，筛选满足条件的序列

    参数:
        file_pattern: 文件路径模式(如"./data/*.txt")
        output_dir: 输出目录路径
    """
    try:
        # 确保输出目录存在且有写入权限
        os.makedirs(output_dir, exist_ok=True)

        # 测试是否有写入权限
        test_file = os.path.join(output_dir, 'permission_test.txt')
        with open(test_file, 'w') as f:
            f.write('test')
        os.remove(test_file)

        # 获取所有TXT文件
        txt_files = glob.glob(file_pattern)
        if not txt_files:
            raise FileNotFoundError(f"没有找到匹配的文件: {file_pattern}")
        print(f"找到 {len(txt_files)} 个数据文件")

        all_valid_sequences = []

        for file_path in txt_files:
            try:
                # 检查文件是否可读
                if not os.access(file_path, os.R_OK):
                    print(f"警告: 无读取权限，跳过文件: {file_path}")
                    continue

                with open(file_path, 'r', encoding='utf-8') as file:
                    lines = [line.strip() for line in file.readlines() if line.strip()]

                # 提取时间戳和雨水含量对
                data_pairs = []
                for i in range(0, len(lines), 2):
                    if i + 1 >= len(lines):
                        continue
                    timestamp_str = lines[i]
                    try:
                        timestamp = datetime.strptime(timestamp_str, "%Y-%m-%d %H:%M:%S")
                        value = float(lines[i + 1])
                        data_pairs.append((timestamp, value))
                    except (ValueError, IndexError) as e:
                        print(f"文件 {file_path} 第{i + 1}行数据格式错误: {e}")
                        continue

                # 寻找至少12分钟的连续序列
                sequences = []
                current_sequence = []

                for i in range(len(data_pairs)):
                    timestamp, value = data_pairs[i]

                    if not current_sequence:
                        current_sequence.append((timestamp, value))
                    else:
                        # 检查时间是否连续(1分钟间隔)
                        prev_time = current_sequence[-1][0]
                        if (timestamp - prev_time) == timedelta(minutes=1):
                            current_sequence.append((timestamp, value))
                        else:
                            # 时间不连续，结束当前序列
                            if len(current_sequence) >= 12:
                                sequences.append(current_sequence)
                            current_sequence = [(timestamp, value)]

                # 检查最后一个序列
                if len(current_sequence) >= 12:
                    sequences.append(current_sequence)

                # 筛选总时长≥60分钟的序列
                valid_sequences = []
                for seq in sequences:
                    duration = (seq[-1][0] - seq[0][0]).total_seconds() / 60 + 1
                    if duration >= 60:
                        valid_sequences.append(seq)

                # 保存结果到文件
                if valid_sequences:
                    base_name = os.path.basename(file_path)
                    output_path = os.path.join(output_dir, f"filtered_{base_name}")

                    try:
                        with open(output_path, 'w', encoding='utf-8') as out_file:
                            for seq in valid_sequences:
                                out_file.write(
                                    f"Sequence from {seq[0][0]} to {seq[-1][0]}, Duration: {len(seq)} minutes\n")
                                for timestamp, value in seq:
                                    out_file.write(f"{timestamp}\t{value}\n")
                                out_file.write("\n")

                        all_valid_sequences.extend(valid_sequences)
                    except IOError as e:
                        print(f"无法写入输出文件 {output_path}: {e}")

            except Exception as e:
                print(f"处理文件 {file_path} 时出错: {e}")
                continue

    except Exception as e:
        print(f"程序初始化错误: {e}")
        return []

    print(f"处理完成，共找到 {len(all_valid_sequences)} 个有效序列")
    return all_valid_sequences


if __name__ == "__main__":
    # 使用绝对路径
    input_pattern = "F:/lianxi/*.txt"  # 修改为你的实际数据路径
    output_directory = "F:/lianxi/filtered_sequences"  # 修改为你想要的输出路径

    # 检查输入路径是否存在
    if not glob.glob(input_pattern):
        print(f"错误: 输入路径不存在或无文件匹配: {input_pattern}")
    else:
        valid_sequences = process_raindrop_data(input_pattern, output_directory)

记录一下结果：

找到 8084 个数据文件
处理完成，共找到 5316 个有效序列

posted @ 2025-04-28 10:19 秋刀鱼CCC Views(52) Comments(0) 收藏举报

刷新页面返回顶部

秋刀鱼CCC

Never be ashamed of trying

python 雨滴谱时间处理：仅保留连续12分钟有效，且总时长>60分钟的降水序列

1. “连续12分钟有效”的含义

2. “总时长>60分钟”的含义

3. 实际筛选示例

公告