# Create ATT&CK Groups Source knowledge
---
* Collaborators:
    * Roberto Rodriguez (@Cyb3rWard0g)
* References:
    * https://python.langchain.com/en/latest/modules/indexes/getting_started.html
    * https://www.youtube.com/watch?v=eqOfr4AGLk8

## Import Modules

In [1]:
from attackcti import attack_client
import os
import logging

logging.getLogger('taxii2client').setLevel(logging.CRITICAL)

## Define Initial Variables

In [2]:
# Define a few variables
current_directory = os.path.dirname("__file__")
documents_directory = os.path.join(current_directory, "documents")
contrib_directory = os.path.join(current_directory, "contrib")
embeddings_directory = os.path.join(current_directory, "embeddings")
templates_directory = os.path.join(current_directory, "templates")
group_template = os.path.join(templates_directory, "group.md")

## Initialize ATT&CK Client

In [3]:
lift = attack_client()

## Get ATT&CK Groups
Gettings technique STIX objects used by all groups accross all ATT&CK matrices..

In [4]:
techniques_used_by_groups = lift.get_techniques_used_by_all_groups()
techniques_used_by_groups[0]

{'type': 'intrusion-set',
 'id': 'intrusion-set--b7f627e2-0817-4cd5-8d50-e75f8aa85cc6',
 'created_by_ref': 'identity--c78cb6e5-0c4b-4611-8297-d1b8b55e40b5',
 'created': '2023-02-23T15:31:38.829Z',
 'modified': '2023-04-17T21:49:16.371Z',
 'name': 'LuminousMoth',
 'description': '[LuminousMoth](https://attack.mitre.org/groups/G1014) is a Chinese-speaking cyber espionage group that has been active since at least October 2020. [LuminousMoth](https://attack.mitre.org/groups/G1014) has targeted high-profile organizations, including government entities, in Myanmar, the Philippines, Thailand, and other parts of Southeast Asia. Some security researchers have concluded there is a connection between [LuminousMoth](https://attack.mitre.org/groups/G1014) and [Mustang Panda](https://attack.mitre.org/groups/G0129) based on similar targeting and TTPs, as well as network infrastructure overlaps.(Citation: Kaspersky LuminousMoth July 2021)(Citation: Bitdefender LuminousMoth July 2021)',
 'aliases': ['L

## Create ATT&CK Groups Documents

In [5]:
import copy
from jinja2 import Template

# Create Group docs
all_groups = dict()
for technique in techniques_used_by_groups:
    if technique['id'] not in all_groups:
        group = dict()
        group['group_name'] = technique['name']
        group['group_id'] = technique['external_references'][0]['external_id']
        group['created'] = technique['created']
        group['modified'] = technique['modified']
        group['description'] = technique['description']
        group['aliases'] = technique['aliases']
        if 'x_mitre_contributors' in technique:
            group['contributors'] = technique['x_mitre_contributors']
        group['techniques'] = []
        all_groups[technique['id']] = group
    technique_used = dict()
    technique_used['matrix'] = technique['matrix']
    technique_used['domain'] = technique['x_mitre_domains']
    technique_used['platform'] = technique['platform']
    technique_used['tactics'] = technique['tactic']
    technique_used['technique_id'] = technique['technique_id']
    technique_used['technique_name'] = technique['technique']
    technique_used['use'] = technique['relationship_description']
    if 'data_sources' in technique:
        technique_used['data_sources'] = technique['data_sources']
    all_groups[technique['id']]['techniques'].append(technique_used)

if not os.path.exists(documents_directory):
   print("[+] Creating knowledge directory..")
   os.makedirs(documents_directory)

print("[+] Creating markadown files for each group..")
markdown_template = Template(open(group_template).read())
for key in list(all_groups.keys()):
    group = all_groups[key]
    print("  [>>] Creating markdown file for {}..".format(group['group_name']))
    group_for_render = copy.deepcopy(group)
    markdown = markdown_template.render(metadata=group_for_render, group_name=group['group_name'], group_id=group['group_id'])
    file_name = (group['group_name']).replace(' ','_')
    open(f'{documents_directory}/{file_name}.md', encoding='utf-8', mode='w').write(markdown)

[+] Creating markadown files for each group..
  [>>] Creating markdown file for LuminousMoth..
  [>>] Creating markdown file for Metador..
  [>>] Creating markdown file for CURIUM..
  [>>] Creating markdown file for EXOTIC LILY..
  [>>] Creating markdown file for Moses Staff..
  [>>] Creating markdown file for SideCopy..
  [>>] Creating markdown file for Aoqin Dragon..
  [>>] Creating markdown file for Earth Lusca..
  [>>] Creating markdown file for POLONIUM..
  [>>] Creating markdown file for LAPSUS$..
  [>>] Creating markdown file for Ember Bear..
  [>>] Creating markdown file for BITTER..
  [>>] Creating markdown file for Aquatic Panda..
  [>>] Creating markdown file for Confucius..
  [>>] Creating markdown file for LazyScripter..
  [>>] Creating markdown file for TeamTNT..
  [>>] Creating markdown file for Andariel..
  [>>] Creating markdown file for Ferocious Kitten..
  [>>] Creating markdown file for IndigoZebra..
  [>>] Creating markdown file for BackdoorDiplomacy..
  [>>] Creat

## Index Source Knowledge

### Load Documents

In [6]:
import glob
from langchain.document_loaders import UnstructuredMarkdownLoader

In [7]:
# variables
group_files = glob.glob(os.path.join(documents_directory, "*.md"))

# Loading Markdown files
md_docs = []
print("[+] Loading Group markdown files..")
for group in group_files:
    print(f' [*] Loading {os.path.basename(group)}')
    loader = UnstructuredMarkdownLoader(group)
    md_docs.extend(loader.load())

print(f'[+] Number of .md documents processed: {len(md_docs)}')

[+] Loading Group markdown files..
 [*] Loading admin@338.md
 [*] Loading Ajax_Security_Team.md
 [*] Loading ALLANITE.md
 [*] Loading Andariel.md
 [*] Loading Aoqin_Dragon.md
 [*] Loading APT-C-36.md
 [*] Loading APT1.md
 [*] Loading APT12.md
 [*] Loading APT16.md
 [*] Loading APT17.md
 [*] Loading APT18.md
 [*] Loading APT19.md
 [*] Loading APT28.md
 [*] Loading APT29.md
 [*] Loading APT3.md
 [*] Loading APT30.md
 [*] Loading APT32.md
 [*] Loading APT33.md
 [*] Loading APT37.md
 [*] Loading APT38.md
 [*] Loading APT39.md
 [*] Loading APT41.md
 [*] Loading Aquatic_Panda.md
 [*] Loading Axiom.md
 [*] Loading BackdoorDiplomacy.md
 [*] Loading BITTER.md
 [*] Loading BlackOasis.md
 [*] Loading BlackTech.md
 [*] Loading Blue_Mockingbird.md
 [*] Loading BRONZE_BUTLER.md
 [*] Loading Carbanak.md
 [*] Loading Chimera.md
 [*] Loading Cleaver.md
 [*] Loading Cobalt_Group.md
 [*] Loading Confucius.md
 [*] Loading CopyKittens.md
 [*] Loading CURIUM.md
 [*] Loading Darkhotel.md
 [*] Loading DarkHyd

Check a doc page content

In [8]:
print(md_docs[0].page_content)

admin@338 - G0018

Created: 2017-05-31T21:31:53.579Z

Modified: 2020-03-18T19:54:59.120Z

Contributors: Tatsuya Daitoku, Cyber Defense Institute, Inc.

Aliases

admin@338

Description

admin@338 is a China-based cyber threat group. It has previously used newsworthy events as lures to deliver malware and has primarily targeted organizations involved in financial, economic, and trade policy, typically using publicly available RATs such as PoisonIvy, as well as some non-public backdoors. (Citation: FireEye admin@338)

Techniques Used

admin@338 has sent emails with malicious Microsoft Office documents attached.(Citation: FireEye admin@338)|
|mitre-attack|enterprise-attack|Linux,macOS,Windows|T1204.002|Malicious File|

admin@338 has attempted to get victims to launch malicious Microsoft Word attachments delivered via spearphishing emails.(Citation: FireEye admin@338)|
|mitre-attack|enterprise-attack|Linux,Windows,macOS|T1203|Exploitation for Client Execution|

admin@338 has exploited clien

### Split Documents
Check token counts on loaded documents

In [9]:
import tiktoken

tokenizer = tiktoken.encoding_for_model('gpt-3.5-turbo')
token_integers = tokenizer.encode(md_docs[0].page_content, disallowed_special=())
num_tokens = len(token_integers)
token_bytes = [tokenizer.decode_single_token_bytes(token) for token in token_integers]

print(f"token count: {num_tokens} tokens")
print(f"token integers: {token_integers}")
print(f"token bytes: {token_bytes}")

token count: 532 tokens
token integers: [2953, 31, 18633, 482, 480, 4119, 23, 271, 11956, 25, 220, 679, 22, 12, 2304, 12, 2148, 51, 1691, 25, 2148, 25, 4331, 13, 24847, 57, 271, 19696, 25, 220, 2366, 15, 12, 2839, 12, 972, 51, 777, 25, 4370, 25, 2946, 13, 4364, 57, 271, 54084, 9663, 25, 350, 1900, 45644, 423, 1339, 16900, 11, 34711, 16777, 10181, 11, 4953, 382, 96309, 271, 2953, 31, 18633, 271, 5116, 271, 2953, 31, 18633, 374, 264, 5734, 6108, 21516, 6023, 1912, 13, 1102, 706, 8767, 1511, 502, 2332, 34594, 4455, 439, 326, 1439, 311, 6493, 40831, 323, 706, 15871, 17550, 11351, 6532, 304, 6020, 11, 7100, 11, 323, 6696, 4947, 11, 11383, 1701, 17880, 2561, 98980, 82, 1778, 439, 52212, 40, 14029, 11, 439, 1664, 439, 1063, 2536, 57571, 1203, 28404, 13, 320, 34, 7709, 25, 6785, 51158, 4074, 31, 18633, 696, 29356, 8467, 12477, 271, 2953, 31, 18633, 706, 3288, 14633, 449, 39270, 5210, 8410, 9477, 12673, 13127, 34, 7709, 25, 6785, 51158, 4074, 31, 18633, 8, 7511, 91, 1800, 265, 12, 21208, 91, 79

Create a length function to calculate the min, max, and avg token count across all document

In [10]:
def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=() #To disable this check for all special tokens
    )
    return len(tokens)

# Get token counts
token_counts = [tiktoken_len(doc.page_content) for doc in md_docs]

print(f"""[+] Token Counts:
Min: {min(token_counts)}
Avg: {int(sum(token_counts) / len(token_counts))}
Max: {max(token_counts)}""")

[+] Token Counts:
Min: 176
Avg: 1619
Max: 7346


Use langchain text splitter

In [11]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [12]:
# Chunking Text
print('[+] Initializing RecursiveCharacterTextSplitter..')
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,  # number of tokens overlap between chunks
    length_function=tiktoken_len,
    separators=['\n\n', '\n', ' ', '']
)

[+] Initializing RecursiveCharacterTextSplitter..


In [13]:
print('[+] Splitting documents in chunks..')
chunks = text_splitter.split_documents(md_docs)

print(f'[+] Number of documents: {len(md_docs)}')
print(f'[+] Number of chunks: {len(chunks)}')

[+] Splitting documents in chunks..
[+] Number of documents: 134
[+] Number of chunks: 534


In [14]:
print(chunks[0])
print(chunks[1])

page_content='admin@338 - G0018\n\nCreated: 2017-05-31T21:31:53.579Z\n\nModified: 2020-03-18T19:54:59.120Z\n\nContributors: Tatsuya Daitoku, Cyber Defense Institute, Inc.\n\nAliases\n\nadmin@338\n\nDescription\n\nadmin@338 is a China-based cyber threat group. It has previously used newsworthy events as lures to deliver malware and has primarily targeted organizations involved in financial, economic, and trade policy, typically using publicly available RATs such as PoisonIvy, as well as some non-public backdoors. (Citation: FireEye admin@338)\n\nTechniques Used\n\nadmin@338 has sent emails with malicious Microsoft Office documents attached.(Citation: FireEye admin@338)|\n|mitre-attack|enterprise-attack|Linux,macOS,Windows|T1204.002|Malicious File|\n\nadmin@338 has attempted to get victims to launch malicious Microsoft Word attachments delivered via spearphishing emails.(Citation: FireEye admin@338)|\n|mitre-attack|enterprise-attack|Linux,Windows,macOS|T1203|Exploitation for Client Execu

### Contribute Split Documents (Optional)
We can contribute this so that others can use the data generated so far.

In [15]:
import hashlib

json_documents = []
m = hashlib.md5()
for doc in md_docs:
    doc_name = os.path.basename(doc.metadata['source'])
    m.update(doc_name.encode('utf-8'))
    uid = m.hexdigest()[:12]
    chunks_strings = text_splitter.split_text(doc.page_content)
    for i, chunk in enumerate(chunks_strings):
        # Add JSON object to array
        json_documents.append({
            'id': f'{uid}-{i}',
            'text': chunk,
            'source': doc_name
        })

In [16]:
print(json_documents[0])
print(json_documents[1])

{'id': '4d1ab63e9fd8-0', 'text': 'admin@338 - G0018\n\nCreated: 2017-05-31T21:31:53.579Z\n\nModified: 2020-03-18T19:54:59.120Z\n\nContributors: Tatsuya Daitoku, Cyber Defense Institute, Inc.\n\nAliases\n\nadmin@338\n\nDescription\n\nadmin@338 is a China-based cyber threat group. It has previously used newsworthy events as lures to deliver malware and has primarily targeted organizations involved in financial, economic, and trade policy, typically using publicly available RATs such as PoisonIvy, as well as some non-public backdoors. (Citation: FireEye admin@338)\n\nTechniques Used\n\nadmin@338 has sent emails with malicious Microsoft Office documents attached.(Citation: FireEye admin@338)|\n|mitre-attack|enterprise-attack|Linux,macOS,Windows|T1204.002|Malicious File|\n\nadmin@338 has attempted to get victims to launch malicious Microsoft Word attachments delivered via spearphishing emails.(Citation: FireEye admin@338)|\n|mitre-attack|enterprise-attack|Linux,Windows,macOS|T1203|Exploitat

Export Knowledge Base as JSONL File (Optional)

In [17]:
import json

print(f'[+] Exporting groups as .jsonl file..')
with open(f'{os.path.join(contrib_directory, "attack-groups.jsonl")}', 'w') as f:
    for doc in json_documents:
        f.write(json.dumps(doc) + '\n')

[+] Exporting groups as .jsonl file..


### Generate Embeddings

In [18]:
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.vectorstores import Chroma

In [19]:
# create the open-source embedding function
embedding_function = SentenceTransformerEmbeddings(model_name="all-mpnet-base-v2")

# load it into Chroma and save it to disk
db = Chroma.from_documents(chunks, embedding_function, collection_name="groups_collection", persist_directory="./chroma_db")


  from .autonotebook import tqdm as notebook_tqdm


ask a question directly to the DB

In [23]:
# query it
query = "What threat actors send text messages to their targets?"
relevant_docs = db.similarity_search(query)

# print results
print(relevant_docs[0].page_content)

Lazarus Group has created new Twitter accounts to conduct social engineering against potential victims.(Citation: Google TAG Lazarus Jan 2021)|
|mitre-attack|enterprise-attack,ics-attack|Linux,macOS,Windows|T1566.003|Spearphishing via Service|

Lazarus Group has used social media platforms, including LinkedIn and Twitter, to send spearphishing messages.(Citation: Google TAG Lazarus Jan 2021)|
|mitre-attack|enterprise-attack,ics-attack|PRE|T1584.004|Server|

Lazarus Group has compromised servers to stage malicious tools.(Citation: Kaspersky ThreatNeedle Feb 2021)|
|mitre-attack|enterprise-attack,ics-attack|PRE|T1591|Gather Victim Org Information|

Lazarus Group has studied publicly available information about a targeted organization to tailor spearphishing efforts against specific departments and/or individuals.(Citation: Kaspersky ThreatNeedle Feb 2021)|
|mitre-attack|enterprise-attack,ics-attack|PRE|T1585.002|Email Accounts|

Lazarus Group has created new email accounts for spearphish