sn-da-large-file-analysis
Installation
SKILL.md
Large Scale Excel Analysis Skill
Mandatory Rules
When total rows >= 10,000, you MUST use the methods in this skill.
| Data Scale | Read Strategy | Reason |
|---|---|---|
| < 10k rows | pd.read_excel() directly |
No memory pressure |
| 10k–100k rows | pd.read_excel() → convert to Parquet → pd.read_parquet() for analysis |
Avoid repeated slow reads |
| 100k–1M rows | openpyxl read_only + iter_rows streaming → Parquet |
pd.read_excel() will OOM or timeout |
| > 1M rows | Streaming read + multi-sheet split (Excel max 1,048,576 rows per sheet) | Must chunk |
Prohibited:
- Do NOT use
pd.read_excel()to fully load 100k+ row files - Do NOT search for fonts with
fc-list,find ... fonts, or install packages withpip install - Do NOT use
df.iterrows()on large DataFrames (useitertuples()or vectorized ops) - Do NOT use
df.apply(lambda...)for operations that can be vectorized
Related skills