[note] MarkItDown: A tool for smoothly converting docx and pptx into markdown

How MarkItDown works?

  • Down below are the module use for MarkItDown
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
dependencies = [
"beautifulsoup4",
"requests",
"mammoth",
"markdownify",
"numpy",
"python-pptx",
"pandas",
"openpyxl",
"pdfminer.six",
"puremagic",
"pydub",
"youtube-transcript-api",
"SpeechRecognition",
"pathvalidate",
]

mammoth

python-pptx

openpyxl

pdfminer

pydub

youtube-transcript-api

SpeechRecognition

markdownify

pathvalidate

puremagic

  • https://github.com/cdgriffith/puremagic/tree/master
  • Puremagic is a file type detection tool that can identify the type of an input file without relying on its extension. Since file extensions can be easily changed, using them alone to identify file type can be risky. This module defines a set of rules to read the file content and determine its type.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# puremagic/magic_data.json
{
"extension_only": [
["", 0, ".txt", "text/plain", "Text File"],
["", 0, ".log", "text/plain", "Logger File"],
["", 0, ".yaml", "application/x-yaml", "YAML File"],
["", 0, ".yml", "application/x-yaml", "YAML File"],
["", 0, ".toml", "application/toml", "TOML File"],
["", 0, ".py", "text/x-python", "Python File"],
["", 0, ".pyc", "application/x-python", "Python Complied File"],
["", 0, ".pyd", "application/x-python", "Python Complied File"],
["", 0, ".python_history", "text/plain", "Python History File"],
["", 0, ".bat", "application/x-script", "Windows BAT file"],
["", 0, ".gitconfig", "text/plain", "Git Ignore File"],
...

[note] MarkItDown: A tool for smoothly converting docx and pptx into markdown

https://hsiangjenli.github.io/blog/note_markitdown/

Author

Hsiang-Jen Li

Posted on

2024-12-13

Updated on

2024-12-18

Licensed under