Background

I have a master degree and more than 3 years’ experience in data science field. After graduation, I joined a heath tech startup firstly. I worked there for two years, My responsibilities were to study user behavior, market, content and new features and to bring data and data insights into every decison the company make.

I joined JPMorgan chase last May. Our team’s goal is to protect the client, the market and the public from the market abuses such as insider trading, piggybacking and so on. Now we are using a vendor product to generate rule based alerts on those market abuses. As you can imagine the rule based system is not that good, we receive many false alerts. But we only want to raise alert for true issues, so my responsibility is to apply advanced analysis and machine learning model to improve the alert accuracy. For my daily job, I’m using Python as the primary programming language, and I’m start to use Scala too for spark job. I’m working with traditional relational database as well as big data infrastructure Hadoop hdfs, hive and impala.

I have accomplished several projects here. The first one is a company name matching project. There are plenty of use cases where we need to match security or company records based on Company Names but not on more structured identifiers such as tickers, CUSIPs, ISINs. One such use case is Insider Trading Affiliate Matching when we match the company name a client writes in his KYC to the list of securities that we currently have in the database.  The matches we get from that is used in the Insider Trading Review to generate an Alert when a trade on a product that matches a client's affiliation is made. There are two major components in this project. Firstly, we generated a reference table. This step includes many ETL tasks to load all the company name and product names from different schemas tables. Also we did some text mining (identify suffix) and text cleaning (remove punctuating and remove suffix) in this step to generate all the possible synonyms for each company. We save this reference table so that we can use it for other use cases. The second component is to do name matching. As we know, fuzzy match is computation expensive, so we did exact match before fuzzy match to minimize the needs of fuzzy match. For the fuzzy match, I built a binary classification model based on multiple string similarity scores and string features. By using the refined fuzzy match algorithm and the reference table, the alert accuracy increased by 15%.

We also accomplished a POC on automating decision making for Trust Discretionary cash distribution requests. This is a cross team projects. A trust distribution request needs to go through different levels of approval by different people, so client may need to wait some time to get the approval, from a couple of weeks to one month. We want to improve the efficiency of this decision making process. After we understand the business, we looked into the data to understand the data. The most important info of a request is in the purpose of the requested cash, it’s in free textual format. The trust officer will read through the contract and other documents of the trust to see if the request is aligned with documentation. So we applied NLP techniques to find out documents similarity. Since this is not the only factor which would affect the decision of request, we also included other structured data, like the amount requested, the request frequency and so on. Based on these features, we built a classification model to predict if a request should be denied or approved. The accuracy of the initial model is above 80%, so it’s a successful POC. Now we are in the process to productionize the model and apply the patent for this solution.

Another project I accomplished is building an automatic email notification tool, based on the data analysis on JIRA data. It provides the team with actionable recommendations on sprint planning.

Now I’m working on building a graph database and applying network analysis to detect insider trading. The insider trading happens when insider info pass through related persons. Now our trade surveillance analysts are googling to find out people would have insider info and the relationships between different people. We want them can find all the info they need in one platform. So we are working on a graph database which will includes all the relationships between data. The first type of relationship we are adding to the graph database is company’s ownership. I developed a data ingestion pipeline to extract the list of insider names from SEC filings. Here insiders means people who are involved in company’s operation or owning more than 10% of the company’s shares. Some of these insiders are our client. Even they are not our client they would probably have some relationship with our client, for example, they be authorized contact of our client account which is another type of relationship and we have this data in another system. It’s just a first step to gather all the relationships we can find in open source system or our internal system, then we will have interactive network visualization in front end for our analysts to investigate the relationship they are interested, eventually we want to apply network analysis to provide more insights to our users.

 

posted @ 2018-09-24 03:14  ffeng0312  阅读(198)  评论(0)    收藏  举报