Análisis de Datos con Pandas - Transformando Datos

Representando el Archivo JSON como Dataframe usando Pandas

La representación del archivo JSON como un Dataframe de Pandas puede involucrar el uso de comandos como wget, unzip. Esto fue explicado con más detalle en el notebook que lleva por título Representando el Archivo JSON como Dataframe usando Pandas, y además se encuentra en la misma carpeta del presente notebook. Para mantener la estructura del presente notebok en un formato simple, los JSON files requeridos para este workshop ya han sido desargados y descomprimidos. Estos archivos se encuentran en la carpeta sets_datos.

! wget https://raw.githubusercontent.com/OTRF/mordor/master/datasets/small/windows/lateral_movement/host/empire_shell_dcerpc_smb_service_dll_hijack.zip -O sets_datos/empire_shell_dcerpc_smb_service_dll_hijack.zip

! unzip -o sets_datos/empire_shell_dcerpc_smb_service_dll_hijack.zip -d sets_datos/

dllhijack_json = 'sets_datos/empire_shell_dcerpc_smb_service_dll_hijack_2020-09-21232839.json'

a) Importando la librería Pandas

import pandas as pd

b) Leyendo Archivo JSON

Usaremos el método pandas.read_json.

Referencia: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html

df = pd.read_json(path_or_buf = dllhijack_json, lines = True)

df.head()
Keywords SeverityValue TargetObject EventTypeOrignal EventID ProviderGuid ExecutionProcessID host Channel UserID ... KeyType ClientProcessId AlgorithmName ReturnCode KeyName KeyFilePath MiniportNameLen MiniportName param4 param3
0 -9223372036854775808 2 HKU\.DEFAULT\Software\Microsoft\Office\16.0\Co... INFO 13 {5770385F-C22A-43E0-BF4C-06F5698FFBD9} 3172 wec.internal.cloudapp.net Microsoft-Windows-Sysmon/Operational S-1-5-18 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 -9223372036854775808 2 NaN NaN 10 {5770385F-C22A-43E0-BF4C-06F5698FFBD9} 3392 wec.internal.cloudapp.net Microsoft-Windows-Sysmon/Operational S-1-5-18 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 -9223372036854775808 2 NaN NaN 10 {5770385F-C22A-43E0-BF4C-06F5698FFBD9} 3392 wec.internal.cloudapp.net Microsoft-Windows-Sysmon/Operational S-1-5-18 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 -9214364837600034816 2 NaN NaN 5158 {54849625-5478-4994-A5BA-3E3B0328C30D} 4 wec.internal.cloudapp.net security NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 -9214364837600034816 2 NaN NaN 5156 {54849625-5478-4994-A5BA-3E3B0328C30D} 4 wec.internal.cloudapp.net security NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 206 columns

c) Conociendo las columnas o atributos del Dataframe

Usaremos el método pandas.DataFrame.info.

Referencia: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html

df.info(verbose = True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6349 entries, 0 to 6348
Data columns (total 206 columns):
 #   Column                     Dtype  
---  ------                     -----  
 0   Keywords                   int64  
 1   SeverityValue              int64  
 2   TargetObject               object 
 3   EventTypeOrignal           object 
 4   EventID                    int64  
 5   ProviderGuid               object 
 6   ExecutionProcessID         int64  
 7   host                       object 
 8   Channel                    object 
 9   UserID                     object 
 10  AccountType                object 
 11  ThreadID                   int64  
 12  ProcessGuid                object 
 13  Details                    object 
 14  EventReceivedTime          object 
 15  Opcode                     object 
 16  EventTime                  object 
 17  @timestamp                 object 
 18  SourceModuleType           object 
 19  port                       int64  
 20  AccountName                object 
 21  RecordNumber               int64  
 22  Task                       int64  
 23  Domain                     object 
 24  @version                   int64  
 25  OpcodeValue                float64
 26  SourceModuleName           object 
 27  Severity                   object 
 28  SourceName                 object 
 29  Version                    float64
 30  Image                      object 
 31  Category                   object 
 32  UtcTime                    object 
 33  Hostname                   object 
 34  RuleName                   object 
 35  tags                       object 
 36  SourceImage                object 
 37  SourceProcessGUID          object 
 38  TargetImage                object 
 39  GrantedAccess              object 
 40  EventType                  object 
 41  SourceProcessId            object 
 42  SourceThreadId             float64
 43  TargetProcessGUID          object 
 44  TargetProcessId            object 
 45  CallTrace                  object 
 46  Application                object 
 47  ProcessId                  object 
 48  Message                    object 
 49  FilterRTID                 float64
 50  LayerRTID                  float64
 51  Protocol                   object 
 52  SourcePort                 float64
 53  LayerName                  object 
 54  SourceAddress              object 
 55  RemoteUserID               object 
 56  Direction                  object 
 57  DestPort                   float64
 58  DestAddress                object 
 59  RemoteMachineID            object 
 60  ActivityID                 object 
 61  Payload                    object 
 62  ERROR_EVT_UNRESOLVED       float64
 63  ContextInfo                object 
 64  ImageLoaded                object 
 65  Signed                     object 
 66  SignatureStatus            object 
 67  Hashes                     object 
 68  Description                object 
 69  Company                    object 
 70  FileVersion                object 
 71  Signature                  object 
 72  Product                    object 
 73  OriginalFileName           object 
 74  SubjectDomainName          object 
 75  SubjectUserSid             object 
 76  SubjectLogonId             object 
 77  TaskContentNew             object 
 78  SubjectUserName            object 
 79  TaskName                   object 
 80  ProcessName                object 
 81  Status                     object 
 82  RuleAttr                   object 
 83  RuleId                     object 
 84  ChangeType                 object 
 85  FilterKey                  object 
 86  FilterType                 object 
 87  FilterName                 object 
 88  Weight                     float64
 89  UserName                   object 
 90  LayerId                    float64
 91  Action                     object 
 92  CalloutKey                 object 
 93  CalloutName                object 
 94  FilterId                   float64
 95  UserSid                    object 
 96  ProviderName               object 
 97  LayerKey                   object 
 98  ProviderKey                object 
 99  Conditions                 object 
 100 PrivilegeList              object 
 101 TargetLogonId              object 
 102 LogonType                  float64
 103 VirtualAccount             object 
 104 LogonGuid                  object 
 105 AuthenticationPackageName  object 
 106 IpAddress                  object 
 107 TransmittedServices        object 
 108 LmPackageName              object 
 109 ImpersonationLevel         object 
 110 ElevatedToken              object 
 111 WorkstationName            object 
 112 TargetOutboundUserName     object 
 113 TargetOutboundDomainName   object 
 114 LogonProcessName           object 
 115 KeyLength                  float64
 116 TargetLinkedLogonId        object 
 117 RestrictedAdminMode        object 
 118 TargetUserName             object 
 119 IpPort                     object 
 120 TargetUserSid              object 
 121 TargetDomainName           object 
 122 EventIdx                   float64
 123 GroupMembership            object 
 124 EventCountTotal            float64
 125 TargetFilename             object 
 126 CreationUtcTime            object 
 127 SourceHandleId             object 
 128 TargetHandleId             object 
 129 ObjectServer               object 
 130 HandleId                   object 
 131 TransactionId              object 
 132 AccessMask                 object 
 133 ObjectName                 object 
 134 ObjectType                 object 
 135 AccessReason               object 
 136 AccessList                 object 
 137 RestrictedSidCount         float64
 138 ResourceAttributes         object 
 139 EnabledPrivilegeList       object 
 140 DisabledPrivilegeList      object 
 141 ShareName                  object 
 142 ShareLocalPath             object 
 143 RelativeTargetName         object 
 144 SourcePortName             object 
 145 DestinationPort            float64
 146 User                       object 
 147 SourceHostname             object 
 148 DestinationIp              object 
 149 SourceIp                   object 
 150 DestinationIsIpv6          object 
 151 Initiated                  object 
 152 SourceIsIpv6               object 
 153 DestinationPortName        object 
 154 DestinationHostname        object 
 155 ParentImage                object 
 156 CommandLine                object 
 157 CurrentDirectory           object 
 158 IntegrityLevel             object 
 159 TerminalSessionId          float64
 160 ParentProcessGuid          object 
 161 ParentCommandLine          object 
 162 ParentProcessId            float64
 163 LogonId                    object 
 164 Device                     object 
 165 NewSd                      object 
 166 OldSd                      object 
 167 MandatoryLabel             object 
 168 ParentProcessName          object 
 169 NewProcessName             object 
 170 TokenElevationType         object 
 171 NewProcessId               object 
 172 PipeName                   object 
 173 Properties                 object 
 174 OperationType              object 
 175 AdditionalInfo             object 
 176 Path                       object 
 177 Priority                   float64
 178 Service                    object 
 179 ServiceName                object 
 180 TicketEncryptionType       object 
 181 ServiceSid                 object 
 182 TicketOptions              object 
 183 QueryResults               object 
 184 QueryName                  object 
 185 QueryStatus                float64
 186 IsExecutable               object 
 187 Archived                   object 
 188 param1                     object 
 189 param2                     object 
 190 MessageNumber              float64
 191 ScriptBlockText            object 
 192 MessageTotal               float64
 193 ScriptBlockId              object 
 194 Operation                  object 
 195 ClientCreationTime         object 
 196 KeyType                    object 
 197 ClientProcessId            float64
 198 AlgorithmName              object 
 199 ReturnCode                 object 
 200 KeyName                    object 
 201 KeyFilePath                object 
 202 MiniportNameLen            float64
 203 MiniportName               object 
 204 param4                     object 
 205 param3                     object 
dtypes: float64(25), int64(9), object(172)
memory usage: 10.0+ MB

Filtrando Eventos de Seguridad: Sysmon 1 (Creacion de Processo)

Vamos a usar el mismo codigo del notebok anterior, pero con una pequeña modificación. En vez de buscar el nombre del channel completo, buscaremos la palabra sysmon.

Referencias:

  • https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html

  • https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.startswith.html

  • https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.endswith.html

(
df[['@timestamp','Image','CommandLine']]
    
[(df['EventID'] == 1) & (df['Channel'].str.contains('sysmon',case = False, na = False, regex = False)) ]
    
.head(5)
)
@timestamp Image CommandLine
661 2020-09-22T03:29:33.845Z C:\Windows\System32\svchost.exe C:\windows\system32\svchost.exe -k appmodel -p...
1034 2020-09-22T03:30:11.221Z C:\Program Files (x86)\Microsoft Office\root\O... "C:\Program Files (x86)\Microsoft Office\root\...
1181 2020-09-22T03:30:11.292Z C:\Program Files (x86)\Microsoft Office\root\O... "C:\Program Files (x86)\Microsoft Office\Root\...
5019 2020-09-22T03:31:11.219Z C:\Windows\System32\sc.exe "C:\windows\system32\sc.exe" \\WORKSTATION6 st...
5315 2020-09-22T03:31:41.475Z C:\Windows\System32\sc.exe "C:\windows\system32\sc.exe" \\WORKSTATION6 qu...

Calculando la Longitud del CommandLine

Usaremos el método assign para agregar una columna nueva a nuestro dataframe. Esta nueva columna mostrará el calculo de la longitud del command line que el processo utilizó.

Referencia: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.assign.html

(
df[['@timestamp','Image','CommandLine']]
    
[(df['EventID'] == 1) & (df['Channel'].str.contains('sysmon',case = False, na = False, regex = False))]
    
.assign(Command_Length = df['CommandLine'].str.len())
)
@timestamp Image CommandLine Command_Length
661 2020-09-22T03:29:33.845Z C:\Windows\System32\svchost.exe C:\windows\system32\svchost.exe -k appmodel -p... 56.0
1034 2020-09-22T03:30:11.221Z C:\Program Files (x86)\Microsoft Office\root\O... "C:\Program Files (x86)\Microsoft Office\root\... 69.0
1181 2020-09-22T03:30:11.292Z C:\Program Files (x86)\Microsoft Office\root\O... "C:\Program Files (x86)\Microsoft Office\Root\... 80.0
5019 2020-09-22T03:31:11.219Z C:\Windows\System32\sc.exe "C:\windows\system32\sc.exe" \\WORKSTATION6 st... 55.0
5315 2020-09-22T03:31:41.475Z C:\Windows\System32\sc.exe "C:\windows\system32\sc.exe" \\WORKSTATION6 qu... 56.0
5552 2020-09-22T03:32:02.675Z C:\Windows\System32\svchost.exe C:\windows\system32\svchost.exe -k netsvcs -p ... 55.0
5673 2020-09-22T03:32:02.741Z C:\Windows\System32\sc.exe "C:\windows\system32\sc.exe" \\WORKSTATION6 st... 56.0

Muchas gracias!! Espero que este notebooks haya sido útil para empezar a revisar algunas técnicas para transformar datos :D

Aún hay más por aprender :D