Análisis de Datos con Pandas - Filtrando y Resumiendo Datos

Representando el Archivo JSON como Dataframe usando Pandas

La representación del archivo JSON como un Dataframe de Pandas puede involucrar el uso de comandos como wget, unzip. Esto fue explicado con más detalle en el notebook que lleva por título Representando el Archivo JSON como Dataframe usando Pandas, y además se encuentra en la misma carpeta del presente notebook. Para mantener la estructura del presente notebok en un formato simple, los JSON files requeridos para este workshop ya han sido desargados y descomprimidos. Estos archivos se encuentran en la carpeta sets_datos.

psremoting_json = 'sets_datos/covenant_psremoting_command_2020-08-06115603.json'

a) Importando la librería Pandas

import pandas as pd

b) Leyendo Archivo JSON

Usaremos el método pandas.read_json.

Referencia: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html

df = pd.read_json(path_or_buf = psremoting_json, lines = True)
df.head(5)
tags @version EventType Version ThreadID EventTime Task AccountType PipeName Channel ... ContextInfo Payload AdditionalInfo AdditionalInfo2 OperationType Properties MiniportNameLen MiniportName param3 param4
0 [mordorDataset] 1 ConnectPipe 1.0 4208 2020-08-06 11:56:04 18 User \lsass Microsoft-Windows-Sysmon/Operational ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 [mordorDataset] 1 INFO 3.0 4208 2020-08-06 11:56:05 10 User NaN Microsoft-Windows-Sysmon/Operational ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 [mordorDataset] 1 INFO 3.0 4208 2020-08-06 11:56:05 10 User NaN Microsoft-Windows-Sysmon/Operational ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 [mordorDataset] 1 AUDIT_SUCCESS 1.0 1808 2020-08-06 11:56:06 12810 NaN NaN Security ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 [mordorDataset] 1 AUDIT_SUCCESS 1.0 1808 2020-08-06 11:56:06 12810 NaN NaN Security ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 180 columns

c) Conociendo las columnas o atributos del Dataframe

Usaremos el método pandas.DataFrame.info.

Referencia: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html

df.info(verbose = True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4284 entries, 0 to 4283
Data columns (total 180 columns):
 #   Column                     Dtype  
---  ------                     -----  
 0   tags                       object 
 1   @version                   int64  
 2   EventType                  object 
 3   Version                    float64
 4   ThreadID                   int64  
 5   EventTime                  object 
 6   Task                       int64  
 7   AccountType                object 
 8   PipeName                   object 
 9   Channel                    object 
 10  EventTypeOrignal           object 
 11  SourceName                 object 
 12  Hostname                   object 
 13  @timestamp                 object 
 14  OpcodeValue                float64
 15  Message                    object 
 16  SourceModuleName           object 
 17  ProcessId                  object 
 18  EventReceivedTime          object 
 19  port                       int64  
 20  AccountName                object 
 21  UtcTime                    object 
 22  Domain                     object 
 23  ExecutionProcessID         int64  
 24  host                       object 
 25  Severity                   object 
 26  SeverityValue              int64  
 27  EventID                    int64  
 28  UserID                     object 
 29  ProviderGuid               object 
 30  RecordNumber               int64  
 31  Image                      object 
 32  Keywords                   int64  
 33  ProcessGuid                object 
 34  SourceModuleType           object 
 35  RuleName                   object 
 36  TargetProcessGUID          object 
 37  SourceProcessGUID          object 
 38  CallTrace                  object 
 39  SourceImage                object 
 40  GrantedAccess              object 
 41  SourceProcessId            object 
 42  SourceThreadId             float64
 43  TargetProcessId            object 
 44  TargetImage                object 
 45  DestAddress                object 
 46  SourcePort                 float64
 47  Category                   object 
 48  Opcode                     object 
 49  Application                object 
 50  Direction                  object 
 51  RemoteMachineID            object 
 52  DestPort                   float64
 53  LayerRTID                  float64
 54  LayerName                  object 
 55  SourceAddress              object 
 56  FilterRTID                 float64
 57  Protocol                   object 
 58  RemoteUserID               object 
 59  TargetObject               object 
 60  SubjectLogonId             object 
 61  PrivilegeList              object 
 62  SubjectDomainName          object 
 63  SubjectUserSid             object 
 64  SubjectUserName            object 
 65  TargetDomainName           object 
 66  ElevatedToken              object 
 67  TargetOutboundDomainName   object 
 68  IpPort                     object 
 69  ImpersonationLevel         object 
 70  ProcessName                object 
 71  KeyLength                  float64
 72  TargetOutboundUserName     object 
 73  RestrictedAdminMode        object 
 74  LogonProcessName           object 
 75  LogonGuid                  object 
 76  WorkstationName            object 
 77  TargetLinkedLogonId        object 
 78  TargetLogonId              object 
 79  TargetUserSid              object 
 80  TransmittedServices        object 
 81  IpAddress                  object 
 82  LmPackageName              object 
 83  TargetUserName             object 
 84  AuthenticationPackageName  object 
 85  VirtualAccount             object 
 86  LogonType                  float64
 87  GroupMembership            object 
 88  EventIdx                   float64
 89  EventCountTotal            float64
 90  DestinationPortName        object 
 91  DestinationIp              object 
 92  SourceIp                   object 
 93  Initiated                  object 
 94  DestinationHostname        object 
 95  SourceHostname             object 
 96  DestinationPort            float64
 97  SourceIsIpv6               object 
 98  User                       object 
 99  DestinationIsIpv6          object 
 100 SourcePortName             object 
 101 ActivityID                 object 
 102 ObjectType                 object 
 103 AccessMask                 object 
 104 AccessList                 object 
 105 ShareName                  object 
 106 Device                     object 
 107 EnabledPrivilegeList       object 
 108 DisabledPrivilegeList      object 
 109 ShareLocalPath             object 
 110 AccessReason               object 
 111 RelativeTargetName         object 
 112 Details                    object 
 113 OriginalFileName           object 
 114 ParentCommandLine          object 
 115 ParentProcessId            float64
 116 FileVersion                object 
 117 Company                    object 
 118 CurrentDirectory           object 
 119 IntegrityLevel             object 
 120 CommandLine                object 
 121 ParentImage                object 
 122 LogonId                    object 
 123 Hashes                     object 
 124 Description                object 
 125 TerminalSessionId          float64
 126 ParentProcessGuid          object 
 127 Product                    object 
 128 Signed                     object 
 129 SignatureStatus            object 
 130 Signature                  object 
 131 ImageLoaded                object 
 132 ObjectName                 object 
 133 ResourceAttributes         object 
 134 TransactionId              object 
 135 RestrictedSidCount         float64
 136 ObjectServer               object 
 137 HandleId                   object 
 138 SourceHandleId             object 
 139 TargetHandleId             object 
 140 NewProcessName             object 
 141 NewProcessId               object 
 142 TokenElevationType         object 
 143 MandatoryLabel             object 
 144 ParentProcessName          object 
 145 Service                    object 
 146 NewSd                      object 
 147 OldSd                      object 
 148 TargetFilename             object 
 149 CreationUtcTime            object 
 150 Path                       object 
 151 Priority                   float64
 152 TaskName                   object 
 153 Status                     object 
 154 IsExecutable               object 
 155 Archived                   object 
 156 QueryResults               object 
 157 QueryName                  object 
 158 QueryStatus                float64
 159 TargetServerName           object 
 160 TargetLogonGuid            object 
 161 TargetInfo                 object 
 162 TicketEncryptionType       object 
 163 ServiceName                object 
 164 ServiceSid                 object 
 165 PreAuthType                float64
 166 TargetSid                  object 
 167 TicketOptions              object 
 168 param1                     object 
 169 param2                     object 
 170 ContextInfo                object 
 171 Payload                    object 
 172 AdditionalInfo             object 
 173 AdditionalInfo2            object 
 174 OperationType              object 
 175 Properties                 object 
 176 MiniportNameLen            float64
 177 MiniportName               object 
 178 param3                     object 
 179 param4                     object 
dtypes: float64(19), int64(9), object(152)
memory usage: 5.9+ MB

Filtrando Columnas o Atributos de nuestro Dataframe

Seleccionando las columnas ‘@timestamp’,’Hostname’,’Channel’,’EventID’ usando una lista con los nombres de las columnas.

df[['@timestamp','Hostname','Channel','ParentImage','Image','EventID']].head()
@timestamp Hostname Channel ParentImage Image EventID
0 2020-08-06T15:56:07.158Z MORDORDC.theshire.local Microsoft-Windows-Sysmon/Operational NaN C:\windows\system32\dns.exe 18
1 2020-08-06T15:56:07.158Z MORDORDC.theshire.local Microsoft-Windows-Sysmon/Operational NaN NaN 10
2 2020-08-06T15:56:07.158Z MORDORDC.theshire.local Microsoft-Windows-Sysmon/Operational NaN NaN 10
3 2020-08-06T15:56:08.170Z WORKSTATION5.theshire.local Security NaN NaN 5156
4 2020-08-06T15:56:08.170Z WORKSTATION5.theshire.local Security NaN NaN 5156

Filtrando Filas o Registros de nuestro Dataframe

a) Una condición

Filtrando nombres de procesos que incluyan el string wsmprovhost.exe.

(
df[['@timestamp','Hostname','Channel','ParentImage','Image','EventID']]
    
[df['Image'].str.contains('wsmprovhost.exe',case = False, na = False, regex = False)]
    
.head(5)
)
@timestamp Hostname Channel ParentImage Image EventID
656 2020-08-06T15:56:24.416Z WORKSTATION6.theshire.local Microsoft-Windows-Sysmon/Operational C:\Windows\System32\svchost.exe C:\Windows\System32\wsmprovhost.exe 1
666 2020-08-06T15:56:24.419Z WORKSTATION6.theshire.local Microsoft-Windows-Sysmon/Operational NaN C:\Windows\System32\wsmprovhost.exe 7
669 2020-08-06T15:56:24.420Z WORKSTATION6.theshire.local Microsoft-Windows-Sysmon/Operational NaN C:\Windows\System32\wsmprovhost.exe 7
673 2020-08-06T15:56:24.423Z WORKSTATION6.theshire.local Microsoft-Windows-Sysmon/Operational NaN C:\Windows\System32\wsmprovhost.exe 7
675 2020-08-06T15:56:24.424Z WORKSTATION6.theshire.local Microsoft-Windows-Sysmon/Operational NaN C:\Windows\System32\wsmprovhost.exe 7

b) Más de una condición: Operadores AND y OR

Podemos usar múltiples condiciones usando los operadores & (AND) y | (OR). El uso de parentesis es importante cuando trabajamos con múltiples condiciones.

(
df[['@timestamp','Hostname','Channel','ParentImage','Image','EventID']]
    
[(df['Image'].str.contains('wsmprovhost.exe',case = False, na = False, regex = False)) |
  (df['ParentImage'].str.contains('wsmprovhost.exe',case = False, na = False, regex = False))]    

.head(5)
)
@timestamp Hostname Channel ParentImage Image EventID
656 2020-08-06T15:56:24.416Z WORKSTATION6.theshire.local Microsoft-Windows-Sysmon/Operational C:\Windows\System32\svchost.exe C:\Windows\System32\wsmprovhost.exe 1
666 2020-08-06T15:56:24.419Z WORKSTATION6.theshire.local Microsoft-Windows-Sysmon/Operational NaN C:\Windows\System32\wsmprovhost.exe 7
669 2020-08-06T15:56:24.420Z WORKSTATION6.theshire.local Microsoft-Windows-Sysmon/Operational NaN C:\Windows\System32\wsmprovhost.exe 7
673 2020-08-06T15:56:24.423Z WORKSTATION6.theshire.local Microsoft-Windows-Sysmon/Operational NaN C:\Windows\System32\wsmprovhost.exe 7
675 2020-08-06T15:56:24.424Z WORKSTATION6.theshire.local Microsoft-Windows-Sysmon/Operational NaN C:\Windows\System32\wsmprovhost.exe 7

Resumiendo Filas o Registros de nuestro Dataframe

a) Resumiendo los eventos de seguridad para el proveedor Sysmon

Ahora podemos realizar la agrupación del dataframe anterior usando el método groupby y la columna que representa el número de identificaión del evento de seguridad.

Referencia: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

(
df[['@timestamp','Hostname','Channel','ParentImage','Image','EventID']]
    
[(df['Image'].str.contains('wsmprovhost.exe',case = False, na = False, regex = False)) |
  (df['ParentImage'].str.contains('wsmprovhost.exe',case = False, na = False, regex = False))]    

.groupby(['EventID']).size()
)
EventID
1       1
5       1
7      91
11      3
12    148
17      1
18      1
23      2
dtype: int64

El código ejecutado previamente nos devuelve una Serie. En caso quisieramos convertir este objeto a un dataframe, podemos usar el método to_frame.

Referencia: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.to_frame.html

(
df[['@timestamp','Hostname','Channel','ParentImage','Image','EventID']]
    
[(df['Image'].str.contains('wsmprovhost.exe',case = False, na = False, regex = False)) |
  (df['ParentImage'].str.contains('wsmprovhost.exe',case = False, na = False, regex = False))]    

.groupby(['EventID']).size().to_frame(name = 'Frequencia')
)
Frequencia
EventID
1 1
5 1
7 91
11 3
12 148
17 1
18 1
23 2

b) Resumiendo y Ordenando los eventos de seguridad para el proveedor Sysmon

Similar al codigo anterior, pero ahora vamos a agregar la operacion de ordenamiento usando el método sort_values.

Referencia: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html

(
df[['@timestamp','Hostname','Channel','ParentImage','Image','EventID']]
    
[(df['Image'].str.contains('wsmprovhost.exe',case = False, na = False, regex = False)) |
  (df['ParentImage'].str.contains('wsmprovhost.exe',case = False, na = False, regex = False))]    

.groupby(['EventID']).size().to_frame(name = 'Frequencia').sort_values(by = 'Frequencia', ascending = False)
)
Frequencia
EventID
12 148
7 91
11 3
23 2
1 1
5 1
17 1
18 1

Muchas gracias!! Espero que este notebooks haya sido útil para empezar a revisar algunas técnicas para filtrar y resumir datos :D

Aún hay más por aprender :D